(Following up to my own OP to complete the saga after much help from Ryan 
Senior. I'm replying here instead of to the end of the other thread branch so 
this will be more obvious in later searches. Warning, long and often ignorant.)

The problem might have been caused by factors external to puppetdb (kill -9 
from chaos monkey). This isn't very supported by evidence but it's better than 
most notions.

My solution was ultimately to drop the puppetdb database and recreate it 
(puppetized puppetdb host). I was fortunate to not care about the data save for 
how the latest agent run went and that would be repopulated soon enough. The 
procedure:

a) on the puppetdb host

su - postgres
psql
drop database puppetdb

b) on the puppetmaster host, scrub puppetdb integration so that the agent run 
won't choke

unalias rm
rm -f /etc/monit.d/puppet*
monit reload
/etc/init.d/puppet stop
/etc/init.d/httpd stop
rm /etc/puppet/puppetdb.conf
rm /etc/puppet/routes.yaml

>From /etc/puppet/puppetmaster.conf (I use this in preference to puppet.conf 
>for everything), remove the following settings from the master section of the 
>file:

storeconfigs
storeconfigs_backend
reports

/etc/init.d/httpd start

c) on the puppetdb host, enforce the config including postgres createdb etc.

/etc/init.d/puppet restart

d) on the puppetmaster host, put things back how they should be

/etc/init.d/puppet restart


We didn't find direct evidence of what happened, but we came close. If this 
happens to me again I will apply the following to investigate:

The json in /var/lib/puppetdb/mq/discarded/replace-catalog has the name of the 
server causing the issue. For me this is serialized right near the end of the 
json line before the transaction uuid. In my case it was only one host for all 
failures which narrowed it down.

"name":"server.me.com","transaction-uuid":"b97fa362-d7ed-499b-a3ac-3c558583f396"

https://docs.puppetlabs.com/puppetdb/2.2/api/wire_format/catalog_format_v5.html

Checking the database schema, it looked like puppetdb was trying to insert an 
identical resourcehash+name.

puppetdb=# \d resource_params;
        Table "public.resource_params"
  Column  |         Type          | Modifiers 
----------+-----------------------+-----------
 resource | character varying(40) | not null
 name     | text                  | not null
 value    | text                  | not null
Indexes:
    "resource_params_pkey" PRIMARY KEY, btree (resource, name)
    "idx_resources_params_name" btree (name)
    "idx_resources_params_resource" btree (resource)
Foreign-key constraints:
    "resource_params_resource_fkey" FOREIGN KEY (resource) REFERENCES 
resource_params_cache(resource) ON DELETE CASCADE

I cranked up logging by modifying /etc/puppetdb/logback.xml to change 
com.puppetlabs.puppetdb.scf.storage to "debug", which was about as informative.

https://docs.puppetlabs.com/puppetdb/latest/configure.html#the-logback-logging-config-file

If I was able to squeeze more time out of my day I might have tried to hash all 
the resources in a catalog by myself and see which one was causing the failure, 
but I really had to get back to things.

This puppetdb host was previously running 1.6.3 and a random kill-9 sort of 
chaos monkey, so it's certainly possible that a puppetdb catalog replacement 
was destroyed halfway through and we had stale data left over. I stopped the 
chaos monkey from running on our puppet infrastructure hosts since we were way 
beyond knowing that services would be put back up after dying.

I thought that 'puppet node deactivate server.me.com' on the puppetmaster might 
help, but it didn't. That sounded like the resources were disassociated with 
the data about the hostname but it's not like I understand what's going on 
under the hood.

https://docs.puppetlabs.com/puppetdb/latest/maintain_and_tune.html

Per Ryan Senior this was the first time they've seen this sort of thing, so I'm 
favouring a theory where an interaction between the different sql commits from 
puppetdb, the chaos monkey, and some resource or another went a bit wrong and 
left stale data behind.

On Tue, Nov 11, 2014 at 09:13:21AM -0500, Christopher Wood wrote:
> Do any of you have troubleshooting tips for the puppetdb error log message 
> about violating database constraints? How would I tell which constraint and 
> which resource, short of logging all sql queries?
> 
> 2014-11-11 08:29:34,814 ERROR [c.p.p.command] 
> [6133d0c4-3350-4567-abb0-476980a9616e] [replace catalog] Retrying after 
> attempt 4, due to: org.postgresql.util.PSQLException: ERROR: duplicate key 
> value violates unique constraint "resource_params_pkey"
> 
> The full error: https://gist.github.com/anonymous/3a169a2b9380196983c6
> 
> 
> More details...
> 
> Per our logs, we've been seeing this with puppetdb 1.6.3 and 2.2.2.
> 
> So far I've tried, with little success:
> 
> -raising the storage log level to debug
> -digging through /var/log/puppetdb/puppetdb.log before/after debug; couldn't 
> see a pattern
> -stop puppetdb, purge files under /var/lib/puppetdb/mq, start puppetdb; to 
> start clean and see if maybe something was corrupt/cached, but didn't help
> 
> The kahadb page:
> 
> https://docs.puppetlabs.com/puppetdb/latest/trouble_kahadb_corruption.html
> 
> We rebuild preprod machines at intervals and we do use some exported 
> resources but I'd like to know what's happening before I start firing off 
> "puppet node deactivate randomhost" until the problem fixes itself.
> 
> https://ask.puppetlabs.com/question/88/how-can-i-purge-exported-resources-from-puppetdb/
> 
> Logging all sql queries:
> 
> http://stackoverflow.com/questions/722221/how-to-log-postgres-sql-queries
> 
> The stack:
> 
> java 1.7.0 openjdk
> postgresql 8.4 (it came with CentOS)
> puppet 3.7.2 puppetmasters and on puppetdb host, but 3.6.2 on most nodes
> puppetdb 2.2.2
> puppetdb-terminus 2.2.2 on all puppetmasters
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to puppet-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/puppet-users/20141111141320.GA6861%40iniquitous.heresiarch.ca.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/20141119195454.GA7905%40iniquitous.heresiarch.ca.
For more options, visit https://groups.google.com/d/optout.

Reply via email to