Re: [Sks-devel] SKS Performance oddity

2019-03-09 Thread Jeremy T. Bouse


On 3/9/2019 5:29 AM, Michiel van Baak wrote:
> 
> Hey,
> 
> I hav exactly the same problem.
> Several times in the last month I have done the following steps:
> 
> - Stop all nodes
> - Destroy the datasets (both db and ptree)
> - Load in a new dump from max 2 days old
> - Create the ptree database
> - Start sks on the primary node, without peering configured (comment out
>   all peers)
> - Give it some time to start
> - Check the stats page and run a couple of searches
> # Up until here everything works fine #
> - Add the outside peers on the primary node and restart it
> - After 5 minutes the machine takes 100% CPU, is stuck in I/O most of
>   the time and falls off the grid
> 
> It doesn't matter if I enable peering with the internal nodes or not.
> Just having 1 SKS instance running, and peering it with the network is
> enough to basically render this instance unusable.
> 
> Like you, I tried in a vm first, and also on a physical machine (dual
> 6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro
> ssds for storage)
> I see exactly the same every time I follow the steps outlined above.
> 
> The systems I tried are Debian linux and FreeBSD and all the same.
> 

I've been trying to narrow it down and zero in on something to fix it,
though I admittedly don't know that much about the internal functions of
the process flow. I have noticed that the issue is not the recon service
itself, despite it appearing so blatantly during the recon mode. It
appears to be from my observation actually the DB service.

At this point I have 5 nodes, sks01 - sks04 are my original 4 VM nodes
all with 2 vCPU/4GB except sks01 which is 4 vCPU/8GB, and then sks0
which is my physical server with 4 core Xeon with 4GB RAM. Currently
sks0 is setup to be my external peering point, originally it was sks01.
I have just finished re-importing the keydump into sks0 and sks01 from
the daily dumps from mattrude.com for 2019-03-08 and 2019-03-09
respectively.

I'm running the following command from another machine to check on things:

>  for I in $(seq 50 54); do echo .${I}; ssh 172.16.20.${I} 'uptime; ps aux| 
> grep sks |grep -v grep; time curl -sf localhost:11371/pks/lookup?op=stats 
> |grep keys:'; echo; done

.50
 18:14:26 up 1 day, 11:30,  7 users,  load average: 0.10, 0.69, 1.31
debian-+ 24595 17.5 13.5 605012 540968 ?   Ss   15:32  28:32
/usr/sbin/sks -stdoutlog db
debian-+ 24596  0.3  0.8  72528 32740 ?Ss   15:32   0:37
/usr/sbin/sks -stdoutlog recon
StatisticsTotal number of keys: 5448526

real0m0.014s
user0m0.004s
sys 0m0.004s

.51
 18:14:28 up 1 day, 14:03,  4 users,  load average: 1.30, 1.65, 1.49
debian-+  5166 32.4 36.0 3059044 2950716 ? Ss   15:37  51:01
/usr/sbin/sks -stdoutlog db
debian-+  5167  0.5  4.0 603644 331260 ?   Ss   15:37   0:48
/usr/sbin/sks -stdoutlog recon
StatisticsTotal number of keys: 5448005

real0m0.022s
user0m0.012s
sys 0m0.000s

.52
 18:14:30 up 7 days, 19:21,  4 users,  load average: 0.98, 0.38, 0.31
debian-+  6234  0.5 38.6 1609044 1565612 ? Rs   Mar06  30:33
/usr/sbin/sks -stdoutlog db
debian-+  6235  0.0  3.8 356328 156708 ?   Ss   Mar06   0:51
/usr/sbin/sks -stdoutlog recon
StatisticsTotal number of keys: 5447149

real1m46.269s
user0m0.012s
sys 0m0.000s

.53
 18:16:17 up 7 days, 19:28,  4 users,  load average: 2.01, 1.55, 0.85
debian-+  5754  0.6 13.6 590840 551360 ?   Ds   Mar05  37:20
/usr/sbin/sks -stdoutlog db
debian-+  5755  0.0  3.1 266908 126064 ?   Ss   Mar05   1:59
/usr/sbin/sks -stdoutlog recon
StatisticsTotal number of keys: 5447523

real0m46.400s
user0m0.008s
sys 0m0.004s

.54
 18:17:05 up 7 days, 19:28,  4 users,  load average: 1.88, 0.87, 0.41
debian-+  5994  0.6 18.5 791456 752596 ?   Ss   Mar05  35:24
/usr/sbin/sks -stdoutlog db
debian-+  5995  0.0  3.0 260224 122112 ?   Ds   Mar05   1:45
/usr/sbin/sks -stdoutlog recon
StatisticsTotal number of keys: 5447788

real0m0.015s
user0m0.008s
sys 0m0.000s


For stability sake I'd removed sks0 and sks01 from my NGINX upstreams,
the exception to this is that I have

location /pks/hashquery {
proxy_method POST;
proxy_pass http://127.0.0.1:11371;
}

so that /pks/hashquery doesn't use the server pool but uses the local
SKS instance. So on sks0 it is only seeing traffic to all traffic to
11370/tcp and only traffic for /pks/hashquery URI to 11371/tcp. All
other /pks URI requests are going to the backend and hitting sks02 - sks04.

I have found some improvement with changes to the *pagesize settings
before re-importing the keydump. Currently all my nodes have had their
data re-imported using the following settings:

pagesize:  128
keyid_pagesize:64
meta_pagesize: 1
subkeyid_pagesize: 128
time_pagesize: 128
tqueue_pagesize:   1
ptree_pagesize:8

I also have the hack to short-circuit the bad actor keys that had been
mentioned on the list using:

if ( $arg_search ~*

Re: [Sks-devel] SKS Performance oddity

2019-03-09 Thread Jim Popovitch
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Sat, 2019-03-09 at 00:22 -0500, Jeremy T. Bouse wrote:
>   I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
> 
>   The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.


I'm running a near identical setup of SKS (but single instance) and I too am
seeing the same behaviour.  A fresh import from Matt's daily data yields a
nice experience.  Once the service is active and online the data quickly
(within hours) slowly becomes corrupted with bogus invalid unsanitized uid
data (which is putting it nicely, it's not really uid data)

- -Jim P.
-BEGIN PGP SIGNATURE-

iQIzBAEBCgAdFiEECPbAhaBWEfiXj/kxdRlcPb+1fkUFAlyD0WoACgkQdRlcPb+1
fkXYkxAAq/prjHxKLqcx+xz9T3181ZvBcO9vGJ+y2ex1miy5XfevIPyxGv5pQn5j
zSPjVFgnNTsT82l1qxKOsloVhIK0DQ+Zuv/X7VOv4M/iLRhBRsjZGqcgWEZH+LR2
pbCUUY4yFg8vn0mFo+UVtG7dBsWdoE31+G9y+X1ezlSYkOcUtGqiuEwPc/6EHiGK
LO6TFo1rdx8/7J3nvcGRwGi7UnRyLdJ3QJUC27wLyeE/uRsjmoG1op6jTFNo2Ebx
1yzkPQjfR7mTg3WKx/p9pMV+nEMDf3akHTPPP9OxROkOm2O5xXnjvhgw9jHnlwR8
vh8wnWwk5DTpIIgUiVYF2h/V7ELJwSG1m1AhwpbFHbrd69rLWeEdFfve8s6XkeKc
IYTbC0BFqX3nZrv4YaFy5BtZ6blfrN4fhYCdav4YHM4PVImgRxN301E8Nij9CSVj
eLlyFeAahFtdni2xjJ3IYoWfNcf9MJ1iOWinG63l8fxvrW6u6rCjB5daPyHg4it6
4wgDpqKj8byiqhYiYyggo36NQxfmMP3JFsWx+M8iqH9sG9lFlXx4mLrpX7Rgc+AW
kNS3lQSODn+vbNLOOyx6ABgSuOWtFmYCo59BQM1qLwmP3L29hYDWwU1yQQj5rqBF
+J5nFJl1APKKTmyrwiSxKS020GE2RHQ2giSn2oPUYaUI3N+pr3w=
=expD
-END PGP SIGNATURE-


___
Sks-devel mailing list
Sks-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/sks-devel


Re: [Sks-devel] SKS Performance oddity

2019-03-09 Thread Michiel van Baak
On Sat, Mar 09, 2019 at 12:22:14AM -0500, Jeremy T. Bouse wrote:
>   I don't know what is going on here with my cluster but I have 3 of 4
> nodes that absolutely perform as I would expect... They have 2 vCPU
> with 4GB RAM each along with an extra 50GB drive exclusively for SKS
> use under /var/lib/sks. The three behaving fine are my sks02, sks03
> and sks04 secondary nodes. My primary node on the other hand is
> another story. First I tried increasing it from 2 vCPU/4GB RAM like
> the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making
> any change. I then built out a new physical server with a quad-core
> Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and
> I'm seeing the same problem. SKS is constantly pegging the CPU at 100%
> and eating up nearly all the memory whether it's running on a virtual
> or physical. server. Recon service is working and I'm ingesting keys
> from peers and peering with my internal cluster nodes but everytime it
> goes into recon mode the node starts failing to respond as the CPU and
> RAM spike which then leads to the node being dropped from the pool as
> the stats page can't be hit before it times out.
> 
>   I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
> 
>   The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.
> Then only primary node handles running NGINX configured for load
> balancing the cluster. The only other daemons running across all nodes
> besides SKS are OpenSSH for remote access, SSSD for centralized
> authenication, Haveged for entropy and Postfix configured for
> smarthost relaying.

Hey,

I hav exactly the same problem.
Several times in the last month I have done the following steps:

- Stop all nodes
- Destroy the datasets (both db and ptree)
- Load in a new dump from max 2 days old
- Create the ptree database
- Start sks on the primary node, without peering configured (comment out
  all peers)
- Give it some time to start
- Check the stats page and run a couple of searches
# Up until here everything works fine #
- Add the outside peers on the primary node and restart it
- After 5 minutes the machine takes 100% CPU, is stuck in I/O most of
  the time and falls off the grid

It doesn't matter if I enable peering with the internal nodes or not.
Just having 1 SKS instance running, and peering it with the network is
enough to basically render this instance unusable.

Like you, I tried in a vm first, and also on a physical machine (dual
6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro
ssds for storage)
I see exactly the same every time I follow the steps outlined above.

The systems I tried are Debian linux and FreeBSD and all the same.

-- 
Michiel van Baak
mich...@vanbaak.eu
GPG key: http://pgp.mit.edu/pks/lookup?op=get=0x6FFC75A2679ED069

NB: I have a new GPG key. Old one revoked and revoked key updated on keyservers.

___
Sks-devel mailing list
Sks-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/sks-devel


Re: [Sks-devel] SKS Performance oddity

2019-03-09 Thread Todd Fleisher
I've been having similar issues his week, though it's mainly high IO load/wait 
that is the issue. Also it's not been my primary nodes that recon with the 
outside world, but some of my secondary nodes that only peer internally. I've 
been restoring them by replacing the DB & PTree files/dirs from another node 
and that seems to do the trick for a period of time but I have already done it 
twice in the last few days so it's not really a sustainable approach. I just 
haven't had time to dig deeper into it to try and determine why it is happening 
and/or how to better protect against it. 

Sent from the Fleishphone

> On Mar 8, 2019, at 19:22, Jeremy T. Bouse  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
> 
>I don't know what is going on here with my cluster but I have 3 of 4
> nodes that absolutely perform as I would expect... They have 2 vCPU
> with 4GB RAM each along with an extra 50GB drive exclusively for SKS
> use under /var/lib/sks. The three behaving fine are my sks02, sks03
> and sks04 secondary nodes. My primary node on the other hand is
> another story. First I tried increasing it from 2 vCPU/4GB RAM like
> the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making
> any change. I then built out a new physical server with a quad-core
> Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and
> I'm seeing the same problem. SKS is constantly pegging the CPU at 100%
> and eating up nearly all the memory whether it's running on a virtual
> or physical. server. Recon service is working and I'm ingesting keys
> from peers and peering with my internal cluster nodes but everytime it
> goes into recon mode the node starts failing to respond as the CPU and
> RAM spike which then leads to the node being dropped from the pool as
> the stats page can't be hit before it times out.
> 
>I've been fighting with this for a several days now... Anyone else
> out there seeing this behavior or if not and have similar resourced
> servers care to share details to see if I'm missing something here.
> 
>The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit.
> Then only primary node handles running NGINX configured for load
> balancing the cluster. The only other daemons running across all nodes
> besides SKS are OpenSSH for remote access, SSSD for centralized
> authenication, Haveged for entropy and Postfix configured for
> smarthost relaying.
> -BEGIN PGP SIGNATURE-
> 
> iQGzBAEBCgAdFiEEakJ0F+CHS9VzhSFg6lYpTv4TPXUFAlyDTX0ACgkQ6lYpTv4T
> PXUB0Qv/fRbDkGPes3eq3xDkv6MQHfVFLXuUNdjOtrgpvCwkiS8b340dDKmI5a+x
> NufUzvSHX4GjOc3Joxivc/N1rA7ENrwEX+2T/cwrE8iu+himuvAJkQtXp2qo2Dye
> 9CgzGKR/J0BO50tdmNCJLp6xuR4eY4ISBo0FeeGplipmZIv5BSqKcTcYWaFCNddr
> FLqk6gKT1yzVHb8aO4KzIyB9CqcJEBbTL/RTaJWslCewYcmikw6NBOc1dV/BoxBA
> uGXK3o48o3mo7LJj+sH8/U6F0Ffqnn/tbwIIe/dZSnyonTyP1ENAN2zBWgdzyiRK
> qp1/TDoFC6FuujJgJNKOSsPMNw9bVd5gXYUIIDIE9YK7SeCEP2us4TWS4LQJmuB9
> 7aidQ0rseyN9cSKrswUyWq7k3pM8iLnzx7D8BwW2uvO2SjKo+ALce5UtjyOhgg9v
> ECnxoKjeUTujle/0ZRyi5AbC3AfKi/CoREIJ98w+tAh7jdM5w34vYH8plekRGbFp
> 4bNo9Fyl
> =EdIY
> -END PGP SIGNATURE-
> 
> ___
> Sks-devel mailing list
> Sks-devel@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/sks-devel
> 


___
Sks-devel mailing list
Sks-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/sks-devel