[Pdns-users] Cache Problems with upgrade to Recursor 3.3

2010-12-01 Thread Jeremy Utley

Good afternoon,

We've been working on upgrading our recursors from 
pdns-recursor-3.1.7.1-1 to pdns-recursor-3.3-1, and have seen some 
oddities I wanted to ask the list about.  First, a basic rundown of our 
environment:


Our existing production servers are running pdns-recursor-3.1.7.1-1 
installed via RPMs downloaded from your website.  The recursor itself is 
ran within a Xen PV virtual machine on a CentOS 5.5 base.  To ensure we 
utilize all 4 cores of the processors in those machines, 2 instances of 
the recursor are launched simultaneously, listening on different IP 
addresses, and we utilize the fork option.  We have a total of 6 
machines configured this way, behind a Foundry load balancer which 
handles sharing the load between them.  This implementation has been in 
place for about a year with no issues.  We also use Cacti graphs for 
collecting performance data, by extending SNMP with output from the 
rec_control command.


The new test server is pdns-recursor-3.3-1 installed via RPM downloaded 
from your website, and also running within a Xen PV virtual machine on a 
CentOS 5.5 base.  Rather than launching multiple instances, we are 
launching 4 recursor threads (machines have 4 CPU cores).  Most other 
settings are configured identically between old and new servers.  This 
test server was added to the load balancer on Monday afternoon, taking a 
fraction of the traffic that would have gone to the 6 old machines.


The problem I'm seeing is the caching does not seem to be working 
properly, which is causing a performance hit.  To document this effect, 
the following graph images were taken a little while ago from our Cacti 
installation:


http://www.jutley.org/DNS

Looking at the 4th graph down, which is the cache statistics on the old 
version recursor, you will see that around 90% of all questions are 
cache hits, with around 10% as cache misses.  And, looking at the third 
graph (showing how fast queries are answered), you'll see that over 90% 
of all queries are answered in less than 1 ms.


However, looking at the bottom graph, which is the cache statistics on 
the new recursor, the statistics are totally different.  Only 1.1% of 
the total questions are cache hits, while 6.8% are cache misses, which 
to me makes no sense, since a question *HAS* to be either a cache hit or 
cache miss.  And, looking at the 7th graph (answer speed on the new 
recursor version), most queries are taking more than 10ms to answer.


Just as additional info, the data collected by cacti to generate these 
graphs comes from the following command:


/usr/bin/rec_control get questions cache-entries cache-hits cache-misses 
concurrent-queries resource-limits unauthorized-tcp unauthorized-udp 
spoof-prevents answers-slow client-parse-errors answers0-1 answers1-10 
answers10-100 answers100-1000 qa-latency


Am I mis-interpreting this, or is there something definately going on?

Thanks for your time,

Jeremy
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] database backends without replication?

2010-12-02 Thread Jeremy Utley

On 12/2/2010 12:42 PM, Mark Felder wrote:
If I use a database backend on each side without database replication, 
can I use an AXFR to have it automatically add a domain to the 
database of the slave, or is this still an issue with AXFR as a whole?


Simply put, I don't want to have to touch the slaves when a new domain 
is added to the master. How does this work without database replication?
Read the PowerDNS documentation on "Supermaster" setups.  Basically, you 
designate your master DNS server as a supermaster, and when the slaves 
receive the announcement from the master, they automatically add that 
domain into themselves, and initiate a transfer.  Removing domains would 
still require manual intervention, however.


Jeremy
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] PowerDNS and DomainKeys Oddities

2010-12-29 Thread Jeremy Utley

Hello everyone!

Today I started adding some DomainKeys entries into our PowerDNS server, 
and I encountered some oddities.  I did some research on the net 
regarding this issue, and didn't find anything that really helped me.  
First off, start with our architecture:


Master Server:  CentOS 5, PowerDNS 2.9.22-1 static RPM downloaded from 
the PowerDNS website.  Also running PowerAdmin 2.1.4 from 
www.poweradmin.org as a web-based interface.  Using the MySQL backend.


Slaves:  6 of them, also running CentOS 5 and PowerDNS 2.9.22-1, with 
replication handled by MySQL instead of AXFR.


The problem:

When I make DomainKeys entries, I'm getting backslashes in my output 
from dig, but when looking at either the PowerAdmin interface or 
directly at the MySQL data, I'm not seeing them.  See below:


Dig output:
;; ANSWER SECTION:
_domainkey.domain.com. 41411  IN  TXT "t=y\; o=-\;"

MySQL output:
mysql> select * from records where name='_domainkey.domain.com';
++---+-+--+-+---+--+-+
| id | domain_id | name| type | content | 
ttl   | prio | change_date |

++---+-+--+-+---+--+-+
| 294898 |  1168 | _domainkey.domain.com | TXT  | "t=y; o=-;" | 
43200 |0 |  1293645521 |

++---+-+--+-+---+--+-+
1 row in set (0.00 sec)

I'm at my wits end.  I can't see where PowerAdmin could be adding the 
entries, since the data in MySQL appears correct, but PowerDNS is 
definately sending out the backslash escape.


Anyone have any suggestions?

Jeremy
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] rec_control wipe-cache not clear packet cache

2011-07-18 Thread Jeremy Utley

Good morning to all on the list!

We ran into an oddity with our recursors this morning that I wanted to 
consult with you about.  There was a domain we knew had it's DNS entries 
changed, but our recursors were still getting the old data, and we 
wanted to fix that.  So, as I always do in this situation, I ran the 
command:


rec_control wipe-cache www.domain.com

as I always do in those situations.  However, for some odd reason, this 
time it did not seem to purge the entry from the cache, because a 
subsequent dig query to the recursor still showed the old entry, with 
the TTL value counting down.  Only by doing a full restart of the 
recursor did we actually get the cache to clear.


This was on PowerDNS Recursor version 3.3, using the pdns-recursor-3.3-1 
rpms available from the PDNS site.


Question:  Is it possible that wipe-cache perhaps does not clear the 
packet cache or any other form of entry caching in the recursors?


Jeremy
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] Odd Recursor/Authoritative problem with a private domain

2011-07-21 Thread Jeremy Utley

Hello to all on the list!

I'm seeing something kind of weird in our DNS setup, and was hoping I 
could bounce it off all of you to see if I could get some input.  First 
off, structure of our system:


6 Recursor servers, sitting behind a Foundry Load balancer, running 
pdns-recursor version 3.3-1 from the RPMs provided by PowerDNS
6 Authoritative servers, also sitting behind a Foundry Load balancer, 
running pdns-static-2.9.22-1 from the RPMs provided by PowerDNS


Other than the below problem, the setup works wonderfully.  On to the 
problem.


We set up a "private" zone named gnint.prv within our authoritative DNS 
servers to provide for private hostnames on our backend network (using 
10.1.20.0/255.255.252.0).  Within our recursors, we put the following 
into our config:


forward-zones-file=/etc/powerdns/stub-zone.conf

and within the stub-zone.conf file, we have the following:

gnint.prv=66.152.94.11, 66.152.94.12, 66.152.94.13
10.in-addr.arpa=66.152.94.11, 66.152.94.12, 66.152.94.13

The IP's referenced in the stub-zone.conf file are our load balancer 
IP's that split across all 6 authoritative servers.


When I try to do a lookup of an address I have defined within the 
gnint.prv domain using the linux "host" command, I get the following:


$ host gn-ldap01.gnint.prv
gn-ldap01.gnint.prv has address 10.1.20.1
Host gn-ldap01.gnint.prv not found: 3(NXDOMAIN)
Host gn-ldap01.gnint.prv not found: 3(NXDOMAIN)

Notice that I get 2 NXDOMAIN responses along with the valid response.  
This is what bugs me, because I think this causes *some* machines to 
fail to resolve the hostname.  If I try some other domain against the 
recursors, I only see one answer:


$ host www.gammanetworking.com
www.gammanetworking.com has address 66.152.94.25

Of course, this would not be working thru the stub-zone.conf facility, 
but instead looking up via whois record.


Also, interesting to note that reverse DNS lookups do not show a similar 
problem:


$ host 10.1.20.1
1.20.1.10.in-addr.arpa domain name pointer gn-ldap01.gnint.prv.


Does anyone have any ideas on what I'm missing?

Jeremy


___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


Re: [Pdns-users] Odd Recursor/Authoritative problem with a private domain

2011-07-21 Thread Jeremy Utley

On 7/21/2011 1:14 PM, Stefan Schmidt wrote:

On Thu, Jul 21, 2011 at 8:00 PM, Jeremy Utley  wrote:

gnint.prv=66.152.94.11, 66.152.94.12, 66.152.94.13
10.in-addr.arpa=66.152.94.11, 66.152.94.12, 66.152.94.13

...

$ host gn-ldap01.gnint.prv
gn-ldap01.gnint.prv has address 10.1.20.1
Host gn-ldap01.gnint.prv not found: 3(NXDOMAIN)
Host gn-ldap01.gnint.prv not found: 3(NXDOMAIN)

Are those machines maybe using some kind of asynchronous dns library?
Not to my knowledge.  The machines are bog standard CentOS 5.6 machines, 
using the stock "host" command that comes with CentOS 
(bind-utils-9.3.6-16.P1.el5  package).

If you do a
dig @  gn-ldap01.gnint.prv
for each of your loadbalancer IPs does it show NXDOMAIN somewhere?
No it does not, running dig against both the recursors and the 
authoritative, whether going thru the load balancer, or directly to the 
machines, no NXDOMAIN responses are given.

Also worth trying: Does ping gn-ldap01.gnint.prv work every time?
It's erratic.  I have seen pings fail due to failure to resolve the 
hostname, other times it works just fine.


Jeremy
___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users


[Pdns-users] Odd Recursor problems

2012-01-20 Thread Jeremy Utley

Hello all,

We're having some odd intermittent problems with our recursor which I'm 
not sure if I should be concerned or not about them.  It seems that
intermittently when we query our recursors for a CNAME record, we're not 
getting a proper response.  I am going to be detailed about the problem,
so this will be a long message, and I apologize in advance for that.  
However, I've about reached my wits end with trying to diagnose this issue.


The problem began when we started getting reports from our clients that 
intermittently their CSS files were not loading.  CSS files are stored with
static images on the Edgecast and Level 3 CDN systems, and 
troubleshooting the chain led us to doing a bunch of DNS tests, and 
that's where things

started getting suspicious.

We're running 6 recursors, all behind a Foundry load balancer, with 
virtual IP's funneling traffic from on-site machines to the recursors.  All
recursors are running the x86_64 RPM of pdns-recursor 3.3 downloaded 
directly from the web site, and the OS is CentOS 5.x.  Until now, we haven't
seen any issues with this setup, and it's been in production for over 3 
years.


Edgecast/Level3 have us setup CDN by creating a CNAME record which 
points at their systems - i.e.


 cdn.domain.com 43200 IN CNAME wpc.1737.edgecastcdn.net.

As part of our troubleshooting, we set up a number of checks within our 
nagios monitoring software to monitor the resolution of these entries.
By use of the nagios "check_dig" plugin, we are able to do resolution 
checks against all 6 of our DNS servers once per minute.  Essentially, 
we have

the plugin running these commands every minute:

 dig @{nameserver-ip} any cdn.domain.com
 dig @{nameserver-ip} a cdn.domain.com
 dig @{nameserver-ip} cname cdn.domain.com

With these tests in place and firing off every minute, we see 
intermittent failures (No ANSWER SECTION found) when querying our 
recursor for A
or ANY, never for CNAME.  When a check fails, on the next check one 
minute later, it passes.  We have a couple of machines that run their 
own BIND
caching nameserver, performing the same tests on them show no issues.  
Also as a test, we set up a dummy record with a CNAME to host on a totally
separate, lightly used authoritative server, and those tests have never 
shown failures either.


The failures appear to be totally random - you might see 2 or 3 failures 
within 15 minutes, and then you might not see another failure for over an

hour.

The syslogs for the recursors also show nothing out of the ordinary.

Right now, I am working under the thought that occasionally, the 
recursor does not get a timely response from the Edgecast/Level3 
authoritative
servers, and is therefore failing.  However, it does seem odd that I 
wouldnt' see the problem with our standalone BIND servers.  One other thing
I have done for testing is to disable load-balanced traffic to one of 
our 6 nameservers, and turned on the recursor trace mode on that nameserver.
However, even with only a few checks every minute addressed to it, 
piecing together the trace logs is still not real easy.


Does anyone else have any thoughts on this?

Thanks for any assistance you can give me!

Jeremy Utley

___
Pdns-users mailing list
Pdns-users@mailman.powerdns.com
http://mailman.powerdns.com/mailman/listinfo/pdns-users