On 4/28/2012 1:04 PM, Paul Rubin wrote:
Roy Smith<r...@panix.com>  writes:
I agree that application-level name cacheing is "wrong", but sometimes
doing it the wrong way just makes sense.  I could whip up a simple
cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the
environment (both technology and bureaucracy), getting a cacheing
nameserver installed might take anywhere from 5 minutes to a few days to ...

IMHO this really isn't one of those times.  The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine.  They would
benefit from being able to share the cache, so now your wrapper gets
more complicated.  If it's not a nameserver then it's something that
fills in for one.  And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines.  So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.

    Each process is analyzing one web site, and has its own cache.
Once the site is analyzed, which usually takes about a minute,
the cache disappears.  Multiple threads are reading multiple pages
from the web site during that time.

    A local cache is enough to fix the huge overhead problem of
doing a DNS lookup for every link found.  One site with a vast
number of links took over 10 hours to analyze before this fix;
now it takes about four minutes.  That solved the problem.
We can probably get an additional minor performance boost with a real
local DNS daemon, and will probably configure one.

    We recently changed servers from Red Hat to CentOS, and management
from CPanel to Webmin.  Before the change, we had a local DNS daemon
with cacheing, so we didn't have this problem.  Webmin's defaults
tend to be on the minimal side.

    The DNS information is used mostly to help decide whether two URLs
actually point to the same IP address, as part of deciding whether a
link is on-site or off-site.  Most of those links will never be read.
We're not crawling the entire site, just looking at likely pages to
find the name and address of the business behind the site.  (It's
part of our "Know who you're dealing with" system, SiteTruth.)
                        
                                John Nagle

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to