Hi Daryl, Yeah that long bit of info was brewing for a while, and not all related to osa-dispatcher. :) There are a few somewhat obscure settings that aren't in the config file directly (like the ability to tune taskomatic via taskomatic.errata_cache_max_work_items and taskomatic.channel_repodata_max_work_items, for example) that were only found from hours of scouring various posts and articles.
The line for osa-dispatcher.debug should look like (except for the # and after it, not sure how much you've dabbled in coding/scripting): osa-dispatcher.debug = 2 # (on a scale from 0-9, I guess, with verbosity increasing as the number also increases). Once that line is updated, and osa-dispatcher restarted, you should see additional entries in /var/log/rhn/osa-dispatcher.log like so: 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib._get_jabber_client('Connecting to', 'yourmasterfqdn') 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.__init__ 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.connect 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.setup_connection('Connected to jabber server', 'yourmasterfqdn') 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/dispatcher_client.start 2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.auth('rhn-dispatcher-sat', 'VinkOxDq35oU8tJqK8L43T5ZRldygDEs', 'superclient', 1) 2016/08/18 11:06:05 -05:00 25096 0.0.0.0: osad/osa_dispatcher.fix_connection('Upstream notification server started on port', 1290) 2016/08/18 11:06:06 -05:00 25096 0.0.0.0: osad/jabber_lib.process_forever 2016/08/18 22:54:51 -05:00 25096 0.0.0.0: osad/dispatcher_client._message_callback('Ping response',) In the OSA dispatcher code, there are some debug lines as such, with the first argument in the log_debug function corresponding to the setting above: log_debug(1, "Upstream notification server started on port", port) log_debug(4, "Subscribed to", c._roster.get_subscribed_to()) log_debug(4, "Subscribed from", c._roster.get_subscribed_from()) log_debug(4, "Subscribed both", c._roster.get_subscribed_both()) if client in rfds: log_debug(5, "before process") client.process(timeout=None) log_debug(5, "after process") Good luck! On Mon, Aug 22, 2016 at 10:57 AM Daryl Rose <darylr...@outlook.com> wrote: > Matt, > > > Thank you very much for this information. There was a lot here to > process, so I had to read it multiple time to make sure that I understood > what you were doing. > > > One of the errors that I saw in the osa-dispatcher log was an invalid > password. I ran the sql command as suggested and saw that there were > in fact two entries for the rhn-dispatcher. One from the original > install a year ago, and then a second one with the FQDN. I removed both > entries, cleaned out the databases and restarted jabberd/osa-dispatcher. > I'll keep an eye on things for a while and see if this helps with my > database issues. > > > Oh, almost forgot......I do not see an entry in /etc/rhn/rhn.conf for > osa_dispatcher.debug. I added it in and stop/started osa-dispatcher, but I > did not see any additional logging. > > > Thanks > > > Daryl > > ------------------------------ > *From:* spacewalk-list-boun...@redhat.com < > spacewalk-list-boun...@redhat.com> on behalf of Matt Moldvan < > m...@moldvan.com> > *Sent:* Friday, August 19, 2016 9:57 AM > *To:* spacewalk-list@redhat.com > > *Subject:* Re: [Spacewalk-list] Ongoing jabberd/osad issues. > To answer your question about how you can tell if clients are registering > (to OSA dispatcher, not jabber), set osa_dispatcher.debug = 4 (goes up to 9 > but that is -too much- info in my experience) in /etc/rhn/rhn.conf and > restart osa-dispatcher. This will tell you if systems are subscribing to > the dispatcher's presence and are ready to receive actions. > > 2016/07/28 08:16:28 -05:00 3187 0.0.0.0: > osad/jabber_lib._roster_callback('Updating the roster', <iq > to='rhn-dispatcher-sat@somefqdn/superclient' type='set' > id='d0i9bb6qg2mxcvod2204d697ar81obq2b9cmljd1'><query xmlns = > 'jabber:iq:roster' ><item jid='osad-5987ca54b5@somefqdn' > subscription='both' /></query></iq>) > 2016/07/28 08:16:28 -05:00 3187 0.0.0.0: > osad/jabber_lib._presence_callback('rhn-dispatcher-sat@somefqdn/superclient', > osad-5987ca54b5@somefqdn, u'subscribed') > > For jabber connections, check /var/log/messages for jabberd entries via > syslog. I had to change rsyslog settings to see all of them, as there were > so many coming in when I'd restart the jabber services that rsyslog would > start to rate-limit them. > > Also, osad should -not- be installed on the masters, it conflicts with the > OSA dispatcher packages... > > If you have to do it manually, do this. > > In the Spacewalk back end database (execute "sudo spacewalk-sql -i" or as > root), run select * from rhnpushdispatcher to see what OSA dispatcher > thinks the password is. Stop jabberd, delete the entries in that table, > delete the Berkley DB files, then start jabberd and osa-dispatcher. I have > more about the issue below, but it's a very long and boring story. > > --- long story > > This is a long story... excuse the rant but it's caused me a lot of late > nights over the course of implementation. This is a sort of catharsis to > be able to send. > > So, I've spent the last year and a half trying various methods to keep OSA > dispatcher on our Spacewalk masters stable, and trying to hammer the square > peg that is jabberd2 into our round hole of using F5s for GTMs and LTMs. > Generally, for example for LDAP services, we create an SSL cert signed by > our internal CA and include the GTM FQDN as the subject, and any LTM FQDN > and pool members as the Subject Alternate Names. But, the question wasn't > really about that, just wanted to share my frustration that jabberd2 (s2s > specifically) doesn't play nicely when a system has the GTM FQDN in the > jabberd config (routes get marked as invalid and so on). Also, the default > implementation in Spacewalk uses the Berkley DB, which is notorious for > crashing and corrupting itself and also doesn't support locking, so good > luck trying to put it on an NFS mount and having multiple proxies writing > to that default database. > > So instead I tried Postgres in the jabber config, which worked well for a > time, but pointing all of our clients to the GTM FQDN and attempting to > schedule actions in the GUI wasn't working out. When you're in an > environment that follows ITIL processes and are doing production systems in > always-varying scheduled and approved change windows, having rhnsd pick up > actions whenever it feels like it (I've seen 24 hours+) just doesn't work > for us. > > I had to disable snapshots, because for when spacewalk-clone-by-date runs > (daily) and also when systems check in with Puppet (ensuring registration > daily), Spacewalk sees that the base channel for thousands of systems has > been updated, and attempts to update these sequentially, locks the table, > and causes memory exhaustion on the master and the database server. > > This week, I gave up on using the proxies for Jabber and pointed all my > clients to the masters in their respective datacenters. This brought up an > interesting issue, in the OSA dispatcher Python code, an attempt is made to > retrieve the password from the rhnPushDispatcher table in the Spacewalk > database. Unfortunately, it only uses part of the Jabber ID, which is hard > coded to "rhn-dispatcher-sat" in the code. So, if you have two, it will > most likely not pull the correct one and forever grab the wrong password > and attempt to use that against Jabber, and then crash OSA dispatcher. > > I personally have two masters (and two OSA dispatchers) because I want > datacenter redundancy in case of an issue in one. Also, I want to use the > GUI via GTM in global availability mode again, for redundancy. So, having > OSA dispatcher running on both is very preferable in my situation. > Unfortunately with the issue I mentioned above about the dispatcher code > not pulling the password properly from rhnPushDispatcher when there are > multiple entries, I had to implement an ugly hack. > > Another issue to throw in here was that jabber and OSA dispatcher would > randomly crash. SM and C2S would segfault at random times and OSA > dispatcher would error out for many various reasons: > - timeouts when it was loading the rows from "active" and "roster-items" > - invalidclient errors (didn't look into these much but maybe old > versions of OSAD components on the clients). ridiculous that one bad > client could cause the whole dispatcher process to bomb out though > - SELinux errors that weren't covered by running > osa-dispatcher-selinux-enable (had to run audit2allow and semodule more > times than I'm proud of, a few times in a while loop because fixing one > SELinux error would lead to another being denied via SELinux in the next > line of the code) > > Below is the script that I mentioned to run the hack to get around the osa > dispatcher laziness of using a "like" statement to pull the password from > rhnPushDispatcher. It's ugly but the dispatcher daemon has stayed up for > almost 24 hours now (I consider this a win after all my other issues and my > year and half long struggle with the software) and we have ~5.5k systems > online with it. It's in cron running every 2 minutes to check on both > jabberd and osa-dispatcher services. > > Note that the SQL file is after changing the jabber config on the masters > to use SQLite (another attempt in the long list of things I tried to keep > things stable). A similar method would most likely work but the > fixjabber.sql file below would need updating. > > I don't like the idea of deleting the entire Jabber database every time, I > think it's a cop out and requires all systems to recreate entries in > multiple tables of that database, whether it's Berkley DB or Postgres or > whatever. > > --- > > [me@ourmaster1 ~]$ cat /var/log/fixjabber.sh.out > osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 > 11:27:59 CDT 2016 > [me@ourmaster2 ~]$ cat /var/log/fixjabber.sh.out > osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 > 11:27:01 CDT 2016 > > [me@ourmaster1 ~]$ sudo cat /usr/local/bin/fixjabber.sh > #!/bin/bash > PATH=/sbin:/usr/bin:/bin > LOGFILE=/var/log/$(basename $0).out > if [ ! -f "${LOGFILE}" ]; then > echo "osa-dispatcher and jabberd have been restarted 0 times since > $(date)" >> "${LOGFILE}" > fi > service jabberd status && service osa-dispatcher status > if [ $? -ne 0 ]; then > service osa-dispatcher stop > service jabberd stop > echo "delete from rhnpushdispatcher where > jabber_id='rhn-dispatcher-sat@$(uname -n)/superclient';" | spacewalk-sql > -i > sqlite3 /var/lib/jabberd/db/sqlite.db < > /usr/local/etc/fixjabber.sql > service jabberd start > service osa-dispatcher start > oldnum=$(cut -d ' ' -f7 ${LOGFILE}) > newnum=$(expr $oldnum + 1) > sed -i "s/$oldnum/$newnum/g" ${LOGFILE} > fi > > [me@ourmaster1 ~]$ cat /usr/local/etc/fixjabber.sql > delete from authreg where username = 'rhn-dispatcher-sat'; > delete from "roster-items" where "collection-owner" = 'rhn-dispatcher-sat@ > ourmaster1.fqdn'; > delete from status where "collection-owner" = > 'rhn-dispatcher-...@ourmaster1.fqdn'; > delete from active where "collection-owner" = > 'rhn-dispatcher-...@ourmaster1.fqdn'; > > > > On Fri, Aug 19, 2016 at 8:34 AM Daryl Rose <darylr...@outlook.com> wrote: > >> Basically you're doing everything that I've been doing with the exception >> of the db_recover command. I was not familiar with that command. >> >> >> How can I tell if the clients are self registering or not? >> >> >> Thank you. >> >> >> Daryl >> >> >> ------------------------------ >> *From:* Robert Paschedag <robert.pasche...@web.de> >> *Sent:* Friday, August 19, 2016 3:42 AM >> *To:* Daryl Rose >> *Cc:* spacewalk-list@redhat.com >> *Subject:* Re: [Spacewalk-list] Ongoing jabberd/osad issues. >> >> Now I had also a problem with the database. Just wanted to check, which >> logfiles of the jabber db are not needed anymore as stated in >> >> http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html >> Berkeley DB Reference Guide: Log file removal >> <http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html> >> web.stanford.edu >> Log file removal. The fourth component of the infrastructure, log file >> removal, concerns the ongoing disk consumption of the database log files. >> >> >> >> Just running "db_archive" killed my jabber db. And....fixing it with >> >> db_recover -v or >> db_recover -c >> >> within /var/lib/jabber/db >> >> did not work. >> >> So I was also in the situation to "clean" the database. >> >> 1. Stop jabberd (/etc/init.d/jabberd stop) >> 2. Stop osa-dispatcher (/etc/init.d/osa-dispatcher stop) >> 3. Remove contents of /var/lib/jabber/db (rm -f /var/lib/jabber/db/*) >> 4. Start jabber (/etc/init.d/jabberd start) >> 5. Start osa-dispatcher (/etc/init.d/osa-dispatcher start) >> >> I thought, I should restart the osad client everywhere....but no....the >> clients are just re-registering themeselves automatically. Of course, I >> have to check this on every client but what I have checked so far is >> looking good. >> >> Regards >> Robert >> >> >> Am 18.08.2016 um 09:30 schrieb Robert Paschedag: >> > Hi Daryl, >> > >> > as long as there are no error messages within the logs that there seems >> to be an error with the jabber db, I wouldn't do anything with the db. >> > >> > As I earlier wrote, I only had to repair the db once within about 3 1/2 >> years. >> > >> > So, what I would do now is to really delete the jabber db (back it >> up... just in case) to start up with a"clean " install. If the clients >> (that already have authentication information) do not re-register >> automatically, you should go to the client, stop osad, remove >> /etc/sysconfig/rhn/osad-auth.conf and start osad again. The client should >> then register and you should see the status on the web GUI as "online". If >> not, check the /var/log/rhn/osad.log on the client (if I remember correct >> right now) and osa-dispatcher logs in the server. >> > >> > I also wrote, that my spacewalk servers are NOT clients of themselves. >> I don't think, that should be a problem but just for " testing " you should >> deactivate osad "client" on the spacewalk server. >> > >> > Start with one test server. >> > >> > Good luck. >> > >> > Regards >> > Robert >> > Am 17.08.2016 20:43 schrieb Daryl Rose <darylr...@outlook.com>: >> >> >> >> I've posted here issues that I've had with jabberd and osad, as have >> others. But I haven't gotten things resolved, so I am posting additional >> information. >> >> >> >> >> >> I put SW into production about a year ago. After a period of time, I >> noticed issues with the WUI and servers not reporting correctly and other >> issues. Google searches show that I need to shutdown spacewalk and remove >> all the contents in /var/lib/jabberd/db. This seemed to work, but after a >> few months, I realized that osad was no longer communicating with >> osa-dispatcher. >> >> >> >> >> >> I started doing some additional research and learned that was not a >> good way to resolve this issue. According to the official Spacewalk >> documentation, I should create a checkpoint and then clean up log files >> keeping the database and auth database files. >> >> >> >> >> >> https://fedorahosted.org/spacewalk/wiki/JabberDatabase >> >> >> >> JabberDatabase – spacewalk - Fedora Hosted >> >> fedorahosted.org >> >> Jabber Database. Spacewalk utilizes Jabber to facilitate >> communications between the server and the clients for osa-dispatcher/osad. >> The Jabber program uses the ... >> >> These are the steps that I followed: >> >> >> >> >> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for >> deletion >> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/ ## delete logs >> >> service jabberd restart >> >> >> >> However, this also causes problems with jabberd and osad. If I use >> the commands as the documentation instructs, then osa-dispatcher will >> start, but die, and I get errors in the log that there is an invalid >> password. >> >> >> >> >> >> So to help explain my issue, I ran a test and tried to capture >> everything that I could and I'll post it here. >> >> >> >> >> >> 1. Listing of /var/lib/jabberd/db >> >> >> >> [root@<spwalk-server> db]# ls >> >> __db.001 __db.006 log.0000000004 log.0000000009 >> log.0000000014 log.0000000019 log.0000000024 sm.db >> >> __db.002 authreg.db log.0000000005 log.0000000010 >> log.0000000015 log.0000000020 log.0000000025 >> >> __db.003 log.0000000001 log.0000000006 log.0000000011 >> log.0000000016 log.0000000021 log.0000000026 >> >> __db.004 log.0000000002 log.0000000007 log.0000000012 >> log.0000000017 log.0000000022 log.0000000027 >> >> __db.005 log.0000000003 log.0000000008 log.0000000013 >> log.0000000018 log.0000000023 log.0000000028 >> >> >> >> 2. Spacewalk Server Status >> >> >> >> [root@<spwalk-server> db]# spacewalk-service status >> >> postmaster (pid 1175) is running... >> >> router (pid 21431) is running... >> >> sm (pid 21441) is running... >> >> c2s (pid 21451) is running... >> >> s2s (pid 21461) is running... >> >> tomcat6 (pid 1304) is running... [ OK ] >> >> httpd (pid 1385) is running... >> >> osa-dispatcher (pid 21479) is running... >> >> rhn-search is running (1441). >> >> cobblerd (pid 1491) is running... >> >> RHN Taskomatic is running (1515). >> >> >> >> 3. Most recent log file entry: >> >> >> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__ >> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0: >> osad/jabber_lib.setup_connection('Connected to jabber server', >> '<spwalk-server>.com') >> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0: >> osad/osa_dispatcher.fix_connection('Upstream notification server started on >> port', 1290) >> >> 2016/08/17 07:44:14 -05:00 21476 0.0.0.0: >> osad/jabber_lib.process_forever >> >> >> >> 4. Ran the commands as instructed in the jabberd documentation. >> >> >> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for >> deletion >> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/ ## delete logs >> >> service jabberd restart >> >> >> >> 5. Log file entry: >> >> >> >> 2016/08/17 13:28:19 -05:00 21476 0.0.0.0: >> osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n File >> "/usr/share/rhn/osad/jabber_lib.py", line 121, in main\n >> self.process_forever(c)\n File "/usr/share/rhn/osad/jabber_lib.py", line >> 179, in process_forever\n self.process_once(client)\n File >> "/usr/share/rhn/osad/osa_dispatcher.py", line 187, in process_once\n >> client.retrieve_roster()\n File "/usr/share/rhn/osad/jabber_lib.py", line >> 729, in retrieve_roster\n stanza = self.get_one_stanza()\n File >> "/usr/share/rhn/osad/jabber_lib.py", line 801, in get_one_stanza\n >> self.process(timeout=tm)\n File "/usr/share/rhn/osad/jabber_lib.py", line >> 1055, in process\n data = self._read(self.BLOCK_SIZE)\nSSLError: >> (\'OpenSSL error; will retry\', "(-1, \'Unexpected EOF\')")\n') >> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__ >> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0: >> osad/jabber_lib.setup_connection('Connected to jabber server', >> '<spwalk-server>.com') >> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0: >> osad/jabber_lib.register('ERROR', 'Invalid password') >> >> >> >> 6. Spacewalk server status >> >> >> >> [root@<spwalk-server> db]# spacewalk-service status >> >> postmaster (pid 1175) is running... >> >> router (pid 27119) is running... >> >> sm (pid 27129) is running... >> >> c2s (pid 27139) is running... >> >> s2s (pid 27149) is running... >> >> tomcat6 (pid 1304) is running... [ OK ] >> >> httpd (pid 1385) is running... >> >> osa-dispatcher dead but pid file exists >> >> rhn-search is running (1441). >> >> cobblerd (pid 1491) is running... >> >> RHN Taskomatic is running (1515). >> >> >> >> 7. Long listing of /var/lib/jabberd/db >> >> >> >> [root@<spwalk-server> db]# ls -l >> >> total 7536 >> >> -rw-r-----. 1 jabber jabber 24576 Aug 17 13:28 __db.001 >> >> -rw-r-----. 1 jabber jabber 204800 Aug 17 13:29 __db.002 >> >> -rw-r-----. 1 jabber jabber 270336 Aug 17 13:29 __db.003 >> >> -rw-r-----. 1 jabber jabber 98304 Aug 17 13:29 __db.004 >> >> -rw-r-----. 1 jabber jabber 753664 Aug 17 13:29 __db.005 >> >> -rw-r-----. 1 jabber jabber 57344 Aug 17 13:29 __db.006 >> >> -rw-r-----. 1 jabber jabber 368640 Aug 17 07:46 authreg.db >> >> -rw-r-----. 1 jabber jabber 10485760 Aug 17 13:29 log.0000000031 >> >> -rw-r-----. 1 jabber jabber 487424 Aug 17 13:29 sm.db >> >> >> >> So, neither completely cleaning out jabberd database/log files works, >> and creating a checkpoint and removing log files that need to be cleaned >> out doesn't' work, so what can I do to get jabberd and osad to work, and to >> push out updates when I need to push them out? >> >> >> >> >> >> Thank you. >> >> >> >> >> >> Daryl >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> > _______________________________________________ >> > Spacewalk-list mailing list >> > Spacewalk-list@redhat.com >> > https://www.redhat.com/mailman/listinfo/spacewalk-list >> > >> _______________________________________________ >> Spacewalk-list mailing list >> Spacewalk-list@redhat.com >> https://www.redhat.com/mailman/listinfo/spacewalk-list > > _______________________________________________ > Spacewalk-list mailing list > Spacewalk-list@redhat.com > https://www.redhat.com/mailman/listinfo/spacewalk-list
_______________________________________________ Spacewalk-list mailing list Spacewalk-list@redhat.com https://www.redhat.com/mailman/listinfo/spacewalk-list