Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-16 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  closed
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:  fixed
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Replying to [comment:17 irl]:
 > Is this code released now, if so we should update metrics-cloud with
 this version ready for deployment in January. A second backend host is
 requested in #32763.

 Yes, this code is released, see #32684. Deploying this version to two
 backends in January sounds fine to me!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-16 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  closed
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:  fixed
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by irl):

 Is this code released now, if so we should update metrics-cloud with this
 version ready for deployment in January. A second backend host is
 requested in #32763.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-08 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  closed
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:  fixed
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Thanks for checking, and glad to hear that things worked out this well!
 We'll take care of coordinating new backend instances. Thanks again!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  closed
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:  fixed
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * status:  merge_ready => closed
 * resolution:   => fixed


Old description:

> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it looks like this, in grafana:
>
> [[Image(snap-2019.12.02-16.06.11.png​)]]
>
> We can clearly see a correlation between the two node's traffic, in
> reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
> during a surge.
>
> It seems this is due to onionoo-backend-01 blasting the disk and CPU for
> some reason. This is the disk I/O graphs for that host, which correlate
> pretty cleanly with the above graphs:
>
> [[Image(snap-2019.12.02-16.30.33.png​)]]
>
> This was confirmed by an inspection of `drbd`, the mechanisms that
> synchronizes the disks across the network. It seems there's a huge surge
> of "writes" on the network every hour which lasts anywhere between 20 and
> 30 minutes. This was (somewhat) confirmed by running:
>
> {{{
> watch -n 0.1 -d cat /proc/drbd
> }}}
>
> on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
> DRBD. 13 and 17 are the web nodes, so that's expected - probably log
> writes? But device ID 4 is onionoo-backend, which is what led me to the
> big traffic graph.
>
> could someone from metrics investigate?
>
> can i just turn off this machine altogether, considering it's basically
> trying to murder the cluster every hour? :)

New description:

 hello!

 today i noticed that, since last friday (UTC) morning, there has been
 pretty big spikes on the internal network between the ganeti nodes, every
 hour. it looks like this, in grafana:

 [[Image(snap-2019.12.02-16.06.11.png, 700​)]]

 We can clearly see a correlation between the two node's traffic, in
 reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
 during a surge.

 It seems this is due to onionoo-backend-01 blasting the disk and CPU for
 some reason. This is the disk I/O graphs for that host, which correlate
 pretty cleanly with the above graphs:

 [[Image(snap-2019.12.02-16.30.33.png​)]]

 This was confirmed by an inspection of `drbd`, the mechanisms that
 synchronizes the disks across the network. It seems there's a huge surge
 of "writes" on the network every hour which lasts anywhere between 20 and
 30 minutes. This was (somewhat) confirmed by running:

 {{{
 watch -n 0.1 -d cat /proc/drbd
 }}}

 on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
 DRBD. 13 and 17 are the web nodes, so that's expected - probably log
 writes? But device ID 4 is onionoo-backend, which is what led me to the
 big traffic graph.

 could someone from metrics investigate?

 can i just turn off this machine altogether, considering it's basically
 trying to murder the cluster every hour? :)

--

Comment:

 wow, that *is* a huge improvement! check this out:

 https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1=1575563766753=1575650166753
 node=omeiense.torproject.org:9100=oo-hetzner-03.torproject.org

 in particular:

 [[Image(snap-2019.12.06-11.36.33.png, 700)]]

 large reduction in CPU and memory usage, significant reduction in load!

 [[Image(snap-2019.12.06-11.43.28.png, 700)]]

 also a *dramatic* reduction in disk utilization! especially: all that
 writing was significantly reduced... but what i find the most interesting
 is this:

 [[Image(snap-2019.12.06-11.49.27.png, 700)]]

 ie. we write less, but we don't read more! even though we're computing all
 those checksums, we don't impose extra load on the disks because of that
 reading, which is one thing I was worried about.

 but even if we would read more (which we don't) it would still be a
 worthwhile tradeoffs because (1) we can cache those and (2) we (obviously)
 don't need to replicate reads across the cluster.

 i can't confirm the effect on the actual ganeti cluster because irl
 (thankfully! :) has turned off those jobs on onionoo-backend-01. but i'm
 not confident the cluster will be happier with this work if/when we turn
 it back on.

 thank you so much for taking the extra time in fixing this and taking care
 of our hardware. sometimes it's easier to throw hardware at a problem, but
 this seemed like a case where we could improve our algos a little, and I'm
 glad it worked out. :)

 all in all, i think this can be marked as fixed, at least it is 

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.06-11.49.27.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.06-11.43.28.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.06-11.38.52.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.06-11.36.33.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-06 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Patch is merged, released, and deployed on both backends. Hourly updates
 are much faster now. Does this show in the graphs?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Oh, I'm not concerned about increase in CPU load. I used the same code to
 compute SHA-256 file digests for #31204 where doing this for 643M of data
 took merely 5 seconds.

 Other mechanisms like timestamps don't work here, because we're actually
 producing new content for these files and only then learn that previously
 written files already have the same content.

 We could do something smarter by detecting whether something has changed
 to the underlying data that we're formatting and writing, but that
 requires more development time that we currently have available.

 I'll deploy this change now, and then we can see how much that bought us.
 If this doesn't fully solve the issue, which might well be the case, we
 can think more about that discrepancy between the two numbers.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by anarcat):

 >  I don't know enough about XFS to know if filesystem attributes can be
 trusted.

 I would be very surprised if they can't, unless you're doing something
 funky with those. That said, filesystems, in general, can hardly be
 trusted :p

 >  The change looks good if we decide we want to go this route. The JVM
 should use the Intel SHA extensions on recent processors, I don't think
 that we're going to see a huge increase in CPU load.

 Okay.

 What do you think about the discrepancy between the estimated writes
 onionoo self-reports and what Prometheus collects?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  merge_ready
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by irl):

 * status:  needs_review => merge_ready


Comment:

 Replying to [comment:10 anarcat]:
 > before we start doing optimization tradeoffs, especially ones involving
 more CPU time, I think we should  diagnose the problem further. running
 checksums can take a while and might impose extra load on the server...
 couldn't we use timestamps or some faster mechanism?

 I don't know enough about XFS to know if filesystem attributes can be
 trusted.

 Replying to [comment:9 karsten]:
 > irl, please review
 
[https://gitweb.torproject.org/user/karsten/onionoo.git/commit/?h=task-32660=4b7b13d37d2423f9b1dee865e203a524b8df488b
 commit 4b7b13d in my task-32660 branch].

 The change looks good if we decide we want to go this route. The JVM
 should use the Intel SHA extensions on recent processors, I don't think
 that we're going to see a huge increase in CPU load.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by anarcat):

 before we start doing optimization tradeoffs, especially ones involving
 more CPU time, I think we should  diagnose the problem further. running
 checksums can take a while and might impose extra load on the server...
 couldn't we use timestamps or some faster mechanism?

 we should also diagnose why our numbers differ. there might be another
 process writing all that data in there...

 someone(tm) should run iotop during the peak periods to figure that out.
 :) i'll see if i can do that soon.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  needs_review
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by karsten):

 * status:  new => needs_review


Comment:

 irl, please review
 
[https://gitweb.torproject.org/user/karsten/onionoo.git/commit/?h=task-32660=4b7b13d37d2423f9b1dee865e203a524b8df488b
 commit 4b7b13d in my task-32660 branch].

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-05 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Here's an attempt to reduce writes:

 {{{
 2019-12-05 09:45:45,384 INFO o.t.m.o.cron.Main:216 Document store
 7 list operations performed
 951,149 files listed
 56,448 files stored
 1.4 GiB stored
 912,471 files not rewritten   <--
 641.3 MiB not rewritten   <--
 983,247 files retrieved
 2.8 GiB retrieved
 0 files removed
 }}}

 The idea is to read existing files before writing them, compute their
 digests, compute digests of the content to be written, and then abort the
 write if nothing has changed.

 It looks like this saves us from overwriting a lot of unchanged files. It
 doesn't save as many bytes, but that might be secondary. Of course, this
 patch comes at the price of doing more reads and digest computations. It
 might be worth it, just saying that it's not free.

 I'll bring this patch up at today's team meeting. We should discuss this
 and possibly merge it before doing new deployments on `onionoo-
 backend-0[12]`.

 And then we could look at those graphs once more to see if we need to
 tweak things more. Thanks for making these graphs!

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-04 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.04-17.23.36.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-04 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by anarcat):

 i had trouble regenerating the report that gave me the 50GiB figure today,
 so here's a more direct link:

 https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1=omeiense.torproject.org:9100=onionoo-
 backend-01.torproject.org:9100=oo-
 hetzner-03.torproject.org:9100=157532880=157533180

 and here's a screenshot

 [[Image(snap-2019.12.04-17.14.49.png, 700)]]

 Here you can clearly see all three servers (from left to right, omeiense,
 onionoo-backend-01, oo-hetzner-03) almost all maxing their disks, for a
 significant amount of time. the older backends (omeiense and hetzner) can
 barely make it in time for the next job: they both took 47 minutes to
 write. the new backend is faster, and makes it in a little over 20
 minutes, but they all take up more than 50% of disk utilization, up to
 100% for the right one. they write between 10 and 40MiB/s if I read those
 graphs right (and if we can trust those stats).

 i'm still learning how to do Prometheus queries, so maybe i don't do this
 right, but this query:

 {{{
 
increase(node_disk_written_bytes_total{instance=~'omeiense\\.torproject\\.org:9100
 |onionoo-backend-01\\.torproject\\.org:9100|oo-
 hetzner-03\\.torproject\\.org:9100'}[1h])
 }}}

 seems to say the servers write between 35GB (omeinse) and 58GB (hetzner)
 every hour:

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-04 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.04-17.14.49.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-04 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by anarcat):

 okay, maybe there's something wrong with my math then... i'll go back to
 my homework, but i really wonder where that discrepancy is coming from...
 could there be other writes that are not covered by those stats?

 we don't seem to be hitting the swap file, so that shouldn't be it...

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-04 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Nah, we're not writing 50 GiB per hour. Here are some statistics:

 {{{
 2019-12-04 10:06:30,905 INFO o.t.m.o.cron.Main:216 Document store
 7 list operations performed
 1,859,665 files listed
 1,867,140 files stored
 2.0 GiB stored
 1,874,298 files retrieved
 2.5 GiB retrieved
 0 files removed
 }}}

 I'm currently looking into avoiding some of those writes by comparing file
 digests before writing. Will post results here once I have them.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-03 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by anarcat):

 >  Let's first find out what's happening there. We were planning to stop
 this instance this afternoon and set up a new one on the same host. If we
 don't know what's going wrong, we might see the same issue with the new
 instance.

 > So, this seems like something that is caused by the hourly updater. Can
 you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly
 the same timing?

 They do look similar, now that you mention it:

 https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1=omeiense.torproject.org:9100=onionoo-
 backend-01.torproject.org:9100=oo-hetzner-03.torproject.org:9100

 The peak bandwidth usage is bigger on onionoo-backend-01, but that might
 be just because the disks are faster; the peak is bigger, but shorter, so
 the transfer size is probably equivalent.

 > I have suspended the hourly updater on this host, but this is the normal
 expected operation for Onionoo's hourly updater. There are state files
 that get updated every run including those for which nothing has changed.
 This is a known flaw in Onionoo but until now it hasn't been a problem.

 Okay, maybe I'm being overly cautious then. It might be a problem in the
 mid-long term in the new cluster because of the way it is structured: all
 writes are replicated between the two servers, because they act as a
 redundant cluster. If one node goes down, the other can take over on the
 fly. It also allows us to migrate the machines between the two servers
 more or less in realtime.

 Maybe we should make an exception for this host, and keep the data only on
 one server. This would have a few implications:

  1. if the server goes down, we need to restore from backups, so there's
 up to 24h of data loss
  2. when we reboot the servers, the machine will go down for the duration
 of the reboot
  3. moving the machine around if we decommission the server will require
 manual work

 > karsten is going to look at how difficult it would be to reduce the
 number of writes performed. Is the problem total IO or is it just the
 writes? Are reads cached? i.e. if we read it again to compare before
 writing, does that help?

 A napkin calculation tells me we're writing about 50GiB of data on the
 disk every hour. That seems like a *lot*!

 (I base this on the graphs that seem to average about 36MiB/s for 35
 minutes on onionoo-backend-01, which means around 56GiB. oo-hetzner-03
 writes 21MiB for 35 minutes, which means about 46GiB. about average
 between the two is 50GiB.)

 Is that about right? What *are* you writing in there? :)

 > There's very little we can do about the CPU load. We already use
 optimized parsing libraries for JSON, and quite simple parsers for Tor
 descriptors. Metrics does involve some computation. If CPU load is a
 problem then perhaps the Ganeti cluster is the wrong place for Onionoo to
 live and we need something else.

 I don't mind the CPU load so much, actually. That we have some capacity.
 And we do have the capacity on the network too - it's a gigabit link after
 all. It's just that this single node is already taking 10% of the capacity
 during those peaks, so I was worried it was an anomaly.

 But maybe there's much ado about nothing here. It just seems strange that
 we write all that data all the time...

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-03 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by irl):

 I have suspended the hourly updater on this host, but this is the normal
 expected operation for Onionoo's hourly updater. There are state files
 that get updated every run including those for which nothing has changed.
 This is a known flaw in Onionoo but until now it hasn't been a problem.

 karsten is going to look at how difficult it would be to reduce the number
 of writes performed. Is the problem total IO or is it just the writes? Are
 reads cached? i.e. if we read it again to compare before writing, does
 that help?

 There's very little we can do about the CPU load. We already use optimized
 parsing libraries for JSON, and quite simple parsers for Tor descriptors.
 Metrics does involve some computation. If CPU load is a problem then
 perhaps the Ganeti cluster is the wrong place for Onionoo to live and we
 need something else.

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-02 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Comment (by karsten):

 Let's first find out what's happening there. We were planning to stop this
 instance this afternoon and set up a new one on the same host. If we don't
 know what's going wrong, we might see the same issue with the new
 instance.

 So, this seems like something that is caused by the hourly updater. Can
 you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly
 the same timing?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-02 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--

Old description:

> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it seems this is due to onionoo-backend-01 blasting the disk and
> CPU for some reason.
>
> could someone from metrics investigate? can i just turn off this machine
> altogether, considering it's basically trying to murder the cluster every
> hour? :)
>
> (will attach explanatory screenshots)

New description:

 hello!

 today i noticed that, since last friday (UTC) morning, there has been
 pretty big spikes on the internal network between the ganeti nodes, every
 hour. it looks like this, in grafana:

 [[Image(snap-2019.12.02-16.06.11.png​)]]

 We can clearly see a correlation between the two node's traffic, in
 reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
 during a surge.

 It seems this is due to onionoo-backend-01 blasting the disk and CPU for
 some reason. This is the disk I/O graphs for that host, which correlate
 pretty cleanly with the above graphs:

 [[Image(snap-2019.12.02-16.30.33.png​)]]

 This was confirmed by an inspection of `drbd`, the mechanisms that
 synchronizes the disks across the network. It seems there's a huge surge
 of "writes" on the network every hour which lasts anywhere between 20 and
 30 minutes. This was (somewhat) confirmed by running:

 {{{
 watch -n 0.1 -d cat /proc/drbd
 }}}

 on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
 DRBD. 13 and 17 are the web nodes, so that's expected - probably log
 writes? But device ID 4 is onionoo-backend, which is what led me to the
 big traffic graph.

 could someone from metrics investigate?

 can i just turn off this machine altogether, considering it's basically
 trying to murder the cluster every hour? :)

--

Comment (by anarcat):

 attach screenshots and further explanations.

 the TL;DR: here is: can i shutdown this backend?

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-02 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.02-16.30.33.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

Re: [tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-02 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   | Resolution:
 Keywords:   |  Actual Points:
Parent ID:   | Points:
 Reviewer:   |Sponsor:
-+--
Changes (by anarcat):

 * Attachment "snap-2019.12.02-16.06.11.png" added.


--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs

[tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

2019-12-02 Thread Tor Bug Tracker & Wiki
#32660: onionoo-backend is killing the ganeti cluster
-+--
 Reporter:  anarcat  |  Owner:  metrics-team
 Type:  defect   | Status:  new
 Priority:  Medium   |  Milestone:
Component:  Metrics/Onionoo  |Version:
 Severity:  Normal   |   Keywords:
Actual Points:   |  Parent ID:
   Points:   |   Reviewer:
  Sponsor:   |
-+--
 hello!

 today i noticed that, since last friday (UTC) morning, there has been
 pretty big spikes on the internal network between the ganeti nodes, every
 hour. it seems this is due to onionoo-backend-01 blasting the disk and CPU
 for some reason.

 could someone from metrics investigate? can i just turn off this machine
 altogether, considering it's basically trying to murder the cluster every
 hour? :)

 (will attach explanatory screenshots)

--
Ticket URL: 
Tor Bug Tracker & Wiki 
The Tor Project: anonymity online
___
tor-bugs mailing list
tor-bugs@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs