I did some hacking on this, and I have additional information. However, not being too DNS-savvy, I'm not sure exactly what to make of it.
1. Lowering the timeout certainly speeds things up. It seems as though a lower timeout for training might be appropriate. 2. Apparently the cache is never pruned, because even though I have the global:verbose flag set I never get see cache statistics, which are printed in the prune() method among other places. 3. Many queries to the actual DNS server (as opposed to the cache) are repeated. A list of all repeated queries is attached below. Some of these queries are repeated multiple times. 4. All queries appear to be of the “A” variety; I'm not getting any “PTR” queries. 5. Clearly not all query results are getting stored in the cache file, because repeated runs yield more actual queries to DNS 6. Something seems to be failing occasionally in the parsing of these host names, because a bunch of them have a trailing close-parenthesis: DNS query: A allmydata.com) ... done DNS query: A allmydata.org) ... done DNS query: A cacm.acm.org) ... done DNS query: A jobs.acm.org) ... done DNS query: A sc09.sc-education.org) ... done DNS query: A technews.acm.org), ... done DNS query: A tocs.acm.org) ... done DNS query: A women.acm.org) ... done DNS query: A www.dac.com) ... done DNS query: A www.reviews.com) ... done Repeated queries: ================ A 05ff.mqoqamuj.cn ... Timeout A 110.mqoqamuj.cn ... Timeout A 191c2.mqoqamuj.cn ... Timeout A 2b1c5.mqoqamuj.cn ... Timeout A 349.mqoqamuj.cn ... Timeout A 38a24e.mqoqamuj.cn ... Timeout A 4808.mqoqamuj.cn ... Timeout A 586c7.mqoqamuj.cn ... Timeout A 5f1.mqoqamuj.cn ... Timeout A 6e46a.mqoqamuj.cn ... Timeout A 79d272.mqoqamuj.cn ... Timeout A 82690.mqoqamuj.cn ... Timeout A 9e945a.mqoqamuj.cn ... Timeout A a32.mba010.cn ... done A aaderminc.com ... Timeout A accordingly.princetonglobal2009.net ... Timeout A acicarpha.com ... Timeout A asparagusgoodnotions.com ... Timeout A azurebug.com ... Timeout A b500.mqoqamuj.cn ... Timeout A beigegratefulnotions.net ... done A belowclimbgreat.com ... Timeout A bravo.validmotions.com ... done A chilean.strathmorebusinessleaders.net ... Timeout A chilledpair.com ... Timeout A click.directworldbrands.com ... Timeout A concurrent-scenes.net ... Timeout A coprophagous.worldwidenomination.com ... Timeout A crosscheck.oapublishing.com ... done A cs212.campingstatus.info ... Timeout A cwuozc.broadloud.cn ... done A cyberpunks.ro ... Timeout A d1055284.domain.com ... done A d4a.mqoqamuj.cn ... Timeout A d8dec.mqoqamuj.cn ... Timeout A dbf7b.mqoqamuj.cn ... Timeout A deeppeachgoodnotions.com ... Timeout A deeppinkgoodnotions.com ... Timeout A deltas.ilikestuff4inbox.com ... Timeout A disgustingob.com ... done A e.swearch.com ... Timeout A e46711.mqoqamuj.cn ... Timeout A easymd.info ... Timeout A expediter.infoininbox.com ... Timeout A f.thegeneralmeweb.com ... Timeout A f.thegreatfinancialworldbureau.com ... done A f1113.mqoqamuj.cn ... Timeout A fjfeadjadcmfehn.swingdraft.com ... Timeout A fortunateshopping.com ... Timeout A gg-email.info ... done A globalmailingservices.com ... Timeout A gnmv218.wemovedmichigan.info ... Timeout A growthblock.com ... Timeout A ilikestuff4inbox.com ... Timeout A infoininbox.com ... Timeout A ipodesktop.net ... done A jepizcz.cn ... Timeout A jt16.wakesabuse.com ... Timeout A june.whoswhomta5.com ... Timeout A linkcontract.com ... Timeout A mail.oversubtly.com ... done A movementstac.com ... Timeout A mx4.datauber.com ... done A mx6.mylistlinetionlive.com ... done A myclubmember.org ... Timeout A ncgeacafdcmjhdi.inmentorindeed.com ... Timeout A numbers.truelife-stores.com ... Timeout A p.thegeneralmeweb.com ... Timeout A p.thegreatfinancialworldbureau.com ... done A paypal-user-confirm.com ... done A playfulpreviews.com ... Timeout A promotions.newegg.com. ... empty label A reefs.kennediabox.com ... Timeout A remot.servicepropers.com ... done A response.calculating-productions.net ... Timeout A searchableappeals.net ... Timeout A secure.skype.com. ... empty label A smileblog.org ... done A stageduel.com ... Timeout A status.twitter. ... empty label A truesilvertech.com ... Timeout A unslingingun.com ... done A unsub.atmsurveys.com ... done A updates.liquiddimensions.net ... Timeout A vesuviuspoeti.com ... Timeout A vuhceiadgmcffiz.growthblock.com ... Timeout A wakesabuse.com ... Timeout A wallstreetreporting-info.com ... Timeout A williammolina.com ... done A wishunfold.com ... Timeout A ww5.domainsauctionsale.com ... Timeout A www. ... empty label A www.almostcrazy.net ... Timeout A www.atmsurveys.com ... done A www.azurebug.com ... Timeout A www.bitineer.com ... Timeout A www.cbnwelcome.com ... done A www.codequiet.com ... Timeout A www.contactthem.ws ... Timeout A www.cyberpunks.ro ... Timeout A www.david.abrahams.cbnwelcome.com ... done A www.easy-md.com ... Timeout A www.gameeuroprime.net ... Timeout A www.globalinternetmail.com ... done A www.icecreamsundaesurveys.com ... Timeout A www.imgettinghealthyforlife.net ... Timeout A www.lifeisalongroad.net ... Timeout A www.lightsavetoday.info ... Timeout A www.mostdo.info ... Timeout A www.nasd.com. ... empty label A www.novelfr.com ... done A www.playthatsongagain.net ... done A www.sgfdsfsd.cn ... done A www.startupsteps.com ... done A www.swearch.com ... Timeout A www.thegeneralmeweb.com ... Timeout A www.thegreatfinancialworldbureau.com ... done A www.vrtsw.com ... done A www.whoswhoregistry.net ... done A www.xpt1.com.br ... done A www.youcanstart.info ... Timeout A xh263647.xarchives.org ... Timeout A yourdatadirect.com ... Timeout A yrhceiadgmcffqj.stageduel.com ... Timeout At Fri, 12 Feb 2010 15:23:21 -0500, David Abrahams wrote: > > > Hi, > > I'm using the train-to-exhaustion script, and it seems to be taking a > seriously long time to process some of the messages. > > It turns out that using -o Tokenizer:x-lookup_ip:True causes a serious > hit to training speed. I do have Tokenizer:lookup_ip_cache set. Not > only that, but it seems to go slowly even on the second and subsequent > training passes, by which time I'd think the cache would be full. So > I'm wondering if the cache is really working, or if it's size-limited > so that it's blown by my large training set, or if there's some other > issue. > > I notice that the cache—as integrated into SpamBayes—doesn't support > selecting a timeout other than “10” nor does it support choosing the > DNS server, even though the cache class itself allows that to be > tuned. I don't have any good reason to think either of these are the > problem. > > Any insight you can offer would be very much appreciated. > > Thanks, > > -- > Dave Abrahams Meet me at BoostCon: http://www.boostcon.com > BoostPro Computing > http://www.boostpro.com > > > > _______________________________________________ > [email protected] > http://mail.python.org/mailman/listinfo/spambayes > Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes > Check the FAQ before asking: http://spambayes.sf.net/faq.html -- Dave Abrahams Meet me at BoostCon: http://www.boostcon.com BoostPro Computing http://www.boostpro.com _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
