[Sks-devel] SKS Performance oddity
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 I don't know what is going on here with my cluster but I have 3 of 4 nodes that absolutely perform as I would expect... They have 2 vCPU with 4GB RAM each along with an extra 50GB drive exclusively for SKS use under /var/lib/sks. The three behaving fine are my sks02, sks03 and sks04 secondary nodes. My primary node on the other hand is another story. First I tried increasing it from 2 vCPU/4GB RAM like the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making any change. I then built out a new physical server with a quad-core Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and I'm seeing the same problem. SKS is constantly pegging the CPU at 100% and eating up nearly all the memory whether it's running on a virtual or physical. server. Recon service is working and I'm ingesting keys from peers and peering with my internal cluster nodes but everytime it goes into recon mode the node starts failing to respond as the CPU and RAM spike which then leads to the node being dropped from the pool as the stats page can't be hit before it times out. I've been fighting with this for a several days now... Anyone else out there seeing this behavior or if not and have similar resourced servers care to share details to see if I'm missing something here. The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit. Then only primary node handles running NGINX configured for load balancing the cluster. The only other daemons running across all nodes besides SKS are OpenSSH for remote access, SSSD for centralized authenication, Haveged for entropy and Postfix configured for smarthost relaying. -BEGIN PGP SIGNATURE- iQGzBAEBCgAdFiEEakJ0F+CHS9VzhSFg6lYpTv4TPXUFAlyDTX0ACgkQ6lYpTv4T PXUB0Qv/fRbDkGPes3eq3xDkv6MQHfVFLXuUNdjOtrgpvCwkiS8b340dDKmI5a+x NufUzvSHX4GjOc3Joxivc/N1rA7ENrwEX+2T/cwrE8iu+himuvAJkQtXp2qo2Dye 9CgzGKR/J0BO50tdmNCJLp6xuR4eY4ISBo0FeeGplipmZIv5BSqKcTcYWaFCNddr FLqk6gKT1yzVHb8aO4KzIyB9CqcJEBbTL/RTaJWslCewYcmikw6NBOc1dV/BoxBA uGXK3o48o3mo7LJj+sH8/U6F0Ffqnn/tbwIIe/dZSnyonTyP1ENAN2zBWgdzyiRK qp1/TDoFC6FuujJgJNKOSsPMNw9bVd5gXYUIIDIE9YK7SeCEP2us4TWS4LQJmuB9 7aidQ0rseyN9cSKrswUyWq7k3pM8iLnzx7D8BwW2uvO2SjKo+ALce5UtjyOhgg9v ECnxoKjeUTujle/0ZRyi5AbC3AfKi/CoREIJ98w+tAh7jdM5w34vYH8plekRGbFp 4bNo9Fyl =EdIY -END PGP SIGNATURE- ___ Sks-devel mailing list Sks-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/sks-devel
Re: [Sks-devel] SKS Performance oddity
I've been having similar issues his week, though it's mainly high IO load/wait that is the issue. Also it's not been my primary nodes that recon with the outside world, but some of my secondary nodes that only peer internally. I've been restoring them by replacing the DB & PTree files/dirs from another node and that seems to do the trick for a period of time but I have already done it twice in the last few days so it's not really a sustainable approach. I just haven't had time to dig deeper into it to try and determine why it is happening and/or how to better protect against it. Sent from the Fleishphone > On Mar 8, 2019, at 19:22, Jeremy T. Bouse wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > >I don't know what is going on here with my cluster but I have 3 of 4 > nodes that absolutely perform as I would expect... They have 2 vCPU > with 4GB RAM each along with an extra 50GB drive exclusively for SKS > use under /var/lib/sks. The three behaving fine are my sks02, sks03 > and sks04 secondary nodes. My primary node on the other hand is > another story. First I tried increasing it from 2 vCPU/4GB RAM like > the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making > any change. I then built out a new physical server with a quad-core > Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and > I'm seeing the same problem. SKS is constantly pegging the CPU at 100% > and eating up nearly all the memory whether it's running on a virtual > or physical. server. Recon service is working and I'm ingesting keys > from peers and peering with my internal cluster nodes but everytime it > goes into recon mode the node starts failing to respond as the CPU and > RAM spike which then leads to the node being dropped from the pool as > the stats page can't be hit before it times out. > >I've been fighting with this for a several days now... Anyone else > out there seeing this behavior or if not and have similar resourced > servers care to share details to see if I'm missing something here. > >The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit. > Then only primary node handles running NGINX configured for load > balancing the cluster. The only other daemons running across all nodes > besides SKS are OpenSSH for remote access, SSSD for centralized > authenication, Haveged for entropy and Postfix configured for > smarthost relaying. > -BEGIN PGP SIGNATURE- > > iQGzBAEBCgAdFiEEakJ0F+CHS9VzhSFg6lYpTv4TPXUFAlyDTX0ACgkQ6lYpTv4T > PXUB0Qv/fRbDkGPes3eq3xDkv6MQHfVFLXuUNdjOtrgpvCwkiS8b340dDKmI5a+x > NufUzvSHX4GjOc3Joxivc/N1rA7ENrwEX+2T/cwrE8iu+himuvAJkQtXp2qo2Dye > 9CgzGKR/J0BO50tdmNCJLp6xuR4eY4ISBo0FeeGplipmZIv5BSqKcTcYWaFCNddr > FLqk6gKT1yzVHb8aO4KzIyB9CqcJEBbTL/RTaJWslCewYcmikw6NBOc1dV/BoxBA > uGXK3o48o3mo7LJj+sH8/U6F0Ffqnn/tbwIIe/dZSnyonTyP1ENAN2zBWgdzyiRK > qp1/TDoFC6FuujJgJNKOSsPMNw9bVd5gXYUIIDIE9YK7SeCEP2us4TWS4LQJmuB9 > 7aidQ0rseyN9cSKrswUyWq7k3pM8iLnzx7D8BwW2uvO2SjKo+ALce5UtjyOhgg9v > ECnxoKjeUTujle/0ZRyi5AbC3AfKi/CoREIJ98w+tAh7jdM5w34vYH8plekRGbFp > 4bNo9Fyl > =EdIY > -END PGP SIGNATURE- > > ___ > Sks-devel mailing list > Sks-devel@nongnu.org > https://lists.nongnu.org/mailman/listinfo/sks-devel > ___ Sks-devel mailing list Sks-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/sks-devel
Re: [Sks-devel] SKS Performance oddity
On Sat, Mar 09, 2019 at 12:22:14AM -0500, Jeremy T. Bouse wrote: > I don't know what is going on here with my cluster but I have 3 of 4 > nodes that absolutely perform as I would expect... They have 2 vCPU > with 4GB RAM each along with an extra 50GB drive exclusively for SKS > use under /var/lib/sks. The three behaving fine are my sks02, sks03 > and sks04 secondary nodes. My primary node on the other hand is > another story. First I tried increasing it from 2 vCPU/4GB RAM like > the others to 2 vCPU/8GB RAM and then 4 vCPU/8GB RAM without it making > any change. I then built out a new physical server with a quad-core > Xeon 2.4GHz processor and 4GB RAM and a dedicated 3TB RAID5 array and > I'm seeing the same problem. SKS is constantly pegging the CPU at 100% > and eating up nearly all the memory whether it's running on a virtual > or physical. server. Recon service is working and I'm ingesting keys > from peers and peering with my internal cluster nodes but everytime it > goes into recon mode the node starts failing to respond as the CPU and > RAM spike which then leads to the node being dropped from the pool as > the stats page can't be hit before it times out. > > I've been fighting with this for a several days now... Anyone else > out there seeing this behavior or if not and have similar resourced > servers care to share details to see if I'm missing something here. > > The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit. > Then only primary node handles running NGINX configured for load > balancing the cluster. The only other daemons running across all nodes > besides SKS are OpenSSH for remote access, SSSD for centralized > authenication, Haveged for entropy and Postfix configured for > smarthost relaying. Hey, I hav exactly the same problem. Several times in the last month I have done the following steps: - Stop all nodes - Destroy the datasets (both db and ptree) - Load in a new dump from max 2 days old - Create the ptree database - Start sks on the primary node, without peering configured (comment out all peers) - Give it some time to start - Check the stats page and run a couple of searches # Up until here everything works fine # - Add the outside peers on the primary node and restart it - After 5 minutes the machine takes 100% CPU, is stuck in I/O most of the time and falls off the grid It doesn't matter if I enable peering with the internal nodes or not. Just having 1 SKS instance running, and peering it with the network is enough to basically render this instance unusable. Like you, I tried in a vm first, and also on a physical machine (dual 6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro ssds for storage) I see exactly the same every time I follow the steps outlined above. The systems I tried are Debian linux and FreeBSD and all the same. -- Michiel van Baak mich...@vanbaak.eu GPG key: http://pgp.mit.edu/pks/lookup?op=get&search=0x6FFC75A2679ED069 NB: I have a new GPG key. Old one revoked and revoked key updated on keyservers. ___ Sks-devel mailing list Sks-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/sks-devel
Re: [Sks-devel] SKS Performance oddity
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Sat, 2019-03-09 at 00:22 -0500, Jeremy T. Bouse wrote: > I've been fighting with this for a several days now... Anyone else > out there seeing this behavior or if not and have similar resourced > servers care to share details to see if I'm missing something here. > > The particulars are that all nodes are Debian 9.8 (Stretch) 64-bit. I'm running a near identical setup of SKS (but single instance) and I too am seeing the same behaviour. A fresh import from Matt's daily data yields a nice experience. Once the service is active and online the data quickly (within hours) slowly becomes corrupted with bogus invalid unsanitized uid data (which is putting it nicely, it's not really uid data) - -Jim P. -BEGIN PGP SIGNATURE- iQIzBAEBCgAdFiEECPbAhaBWEfiXj/kxdRlcPb+1fkUFAlyD0WoACgkQdRlcPb+1 fkXYkxAAq/prjHxKLqcx+xz9T3181ZvBcO9vGJ+y2ex1miy5XfevIPyxGv5pQn5j zSPjVFgnNTsT82l1qxKOsloVhIK0DQ+Zuv/X7VOv4M/iLRhBRsjZGqcgWEZH+LR2 pbCUUY4yFg8vn0mFo+UVtG7dBsWdoE31+G9y+X1ezlSYkOcUtGqiuEwPc/6EHiGK LO6TFo1rdx8/7J3nvcGRwGi7UnRyLdJ3QJUC27wLyeE/uRsjmoG1op6jTFNo2Ebx 1yzkPQjfR7mTg3WKx/p9pMV+nEMDf3akHTPPP9OxROkOm2O5xXnjvhgw9jHnlwR8 vh8wnWwk5DTpIIgUiVYF2h/V7ELJwSG1m1AhwpbFHbrd69rLWeEdFfve8s6XkeKc IYTbC0BFqX3nZrv4YaFy5BtZ6blfrN4fhYCdav4YHM4PVImgRxN301E8Nij9CSVj eLlyFeAahFtdni2xjJ3IYoWfNcf9MJ1iOWinG63l8fxvrW6u6rCjB5daPyHg4it6 4wgDpqKj8byiqhYiYyggo36NQxfmMP3JFsWx+M8iqH9sG9lFlXx4mLrpX7Rgc+AW kNS3lQSODn+vbNLOOyx6ABgSuOWtFmYCo59BQM1qLwmP3L29hYDWwU1yQQj5rqBF +J5nFJl1APKKTmyrwiSxKS020GE2RHQ2giSn2oPUYaUI3N+pr3w= =expD -END PGP SIGNATURE- ___ Sks-devel mailing list Sks-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/sks-devel
Re: [Sks-devel] SKS Performance oddity
On 3/9/2019 5:29 AM, Michiel van Baak wrote: > > Hey, > > I hav exactly the same problem. > Several times in the last month I have done the following steps: > > - Stop all nodes > - Destroy the datasets (both db and ptree) > - Load in a new dump from max 2 days old > - Create the ptree database > - Start sks on the primary node, without peering configured (comment out > all peers) > - Give it some time to start > - Check the stats page and run a couple of searches > # Up until here everything works fine # > - Add the outside peers on the primary node and restart it > - After 5 minutes the machine takes 100% CPU, is stuck in I/O most of > the time and falls off the grid > > It doesn't matter if I enable peering with the internal nodes or not. > Just having 1 SKS instance running, and peering it with the network is > enough to basically render this instance unusable. > > Like you, I tried in a vm first, and also on a physical machine (dual > 6-core xeon E5-2620 0 @ 2.00GHz with 96GB ram and 2 samsung evo 840 pro > ssds for storage) > I see exactly the same every time I follow the steps outlined above. > > The systems I tried are Debian linux and FreeBSD and all the same. > I've been trying to narrow it down and zero in on something to fix it, though I admittedly don't know that much about the internal functions of the process flow. I have noticed that the issue is not the recon service itself, despite it appearing so blatantly during the recon mode. It appears to be from my observation actually the DB service. At this point I have 5 nodes, sks01 - sks04 are my original 4 VM nodes all with 2 vCPU/4GB except sks01 which is 4 vCPU/8GB, and then sks0 which is my physical server with 4 core Xeon with 4GB RAM. Currently sks0 is setup to be my external peering point, originally it was sks01. I have just finished re-importing the keydump into sks0 and sks01 from the daily dumps from mattrude.com for 2019-03-08 and 2019-03-09 respectively. I'm running the following command from another machine to check on things: > for I in $(seq 50 54); do echo .${I}; ssh 172.16.20.${I} 'uptime; ps aux| > grep sks |grep -v grep; time curl -sf localhost:11371/pks/lookup?op=stats > |grep keys:'; echo; done .50 18:14:26 up 1 day, 11:30, 7 users, load average: 0.10, 0.69, 1.31 debian-+ 24595 17.5 13.5 605012 540968 ? Ss 15:32 28:32 /usr/sbin/sks -stdoutlog db debian-+ 24596 0.3 0.8 72528 32740 ?Ss 15:32 0:37 /usr/sbin/sks -stdoutlog recon StatisticsTotal number of keys: 5448526 real0m0.014s user0m0.004s sys 0m0.004s .51 18:14:28 up 1 day, 14:03, 4 users, load average: 1.30, 1.65, 1.49 debian-+ 5166 32.4 36.0 3059044 2950716 ? Ss 15:37 51:01 /usr/sbin/sks -stdoutlog db debian-+ 5167 0.5 4.0 603644 331260 ? Ss 15:37 0:48 /usr/sbin/sks -stdoutlog recon StatisticsTotal number of keys: 5448005 real0m0.022s user0m0.012s sys 0m0.000s .52 18:14:30 up 7 days, 19:21, 4 users, load average: 0.98, 0.38, 0.31 debian-+ 6234 0.5 38.6 1609044 1565612 ? Rs Mar06 30:33 /usr/sbin/sks -stdoutlog db debian-+ 6235 0.0 3.8 356328 156708 ? Ss Mar06 0:51 /usr/sbin/sks -stdoutlog recon StatisticsTotal number of keys: 5447149 real1m46.269s user0m0.012s sys 0m0.000s .53 18:16:17 up 7 days, 19:28, 4 users, load average: 2.01, 1.55, 0.85 debian-+ 5754 0.6 13.6 590840 551360 ? Ds Mar05 37:20 /usr/sbin/sks -stdoutlog db debian-+ 5755 0.0 3.1 266908 126064 ? Ss Mar05 1:59 /usr/sbin/sks -stdoutlog recon StatisticsTotal number of keys: 5447523 real0m46.400s user0m0.008s sys 0m0.004s .54 18:17:05 up 7 days, 19:28, 4 users, load average: 1.88, 0.87, 0.41 debian-+ 5994 0.6 18.5 791456 752596 ? Ss Mar05 35:24 /usr/sbin/sks -stdoutlog db debian-+ 5995 0.0 3.0 260224 122112 ? Ds Mar05 1:45 /usr/sbin/sks -stdoutlog recon StatisticsTotal number of keys: 5447788 real0m0.015s user0m0.008s sys 0m0.000s For stability sake I'd removed sks0 and sks01 from my NGINX upstreams, the exception to this is that I have location /pks/hashquery { proxy_method POST; proxy_pass http://127.0.0.1:11371; } so that /pks/hashquery doesn't use the server pool but uses the local SKS instance. So on sks0 it is only seeing traffic to all traffic to 11370/tcp and only traffic for /pks/hashquery URI to 11371/tcp. All other /pks URI requests are going to the backend and hitting sks02 - sks04. I have found some improvement with changes to the *pagesize settings before re-importing the keydump. Currently all my nodes have had their data re-imported using the following settings: pagesize: 128 keyid_pagesize:64 meta_pagesize: 1 subkeyid_pagesize: 128 time_pagesize: 128 tqueue_pagesize: 1 ptree_pagesize:8 I also have the hack to short-circuit the bad actor keys that had been mentioned on the list using: if ( $arg_search ~* "(0x1013