Re: Rewritten URIBL plugin

2010-07-28 Thread Jared Johnson


 Jared Johnson wrote:

  I think we should probably consider putting support for parsed
 messages
  into core, with the parsing done lazily if requested by the API.

 I forgot, we did kinda think of a couple of reasons not to want an API.
 depending on where you put it, you may find QP in general depending on
 MIME::Parser even if it never uses it.  The benefit of the plugin is
 that
 /etc/qpsmtpd/plugins controls whether you wind up having to install, and
 take up your memory with, MIME::Parser

 One thing we've discussed in the past (at least in my imagination)
 although not quite figured out how to implement, is making plugins act
 a little more like normal modules, so that one plugin can use
 another.

 So if you're interested in the parsed mime functionality, your
 plugin can plugin_use util/parsed-mime and the right magic happens.

Oh yeah that's right, someone *did* implement what you're talking about. 
You can do it with 'plugin inheritance' (which ironically i knew nothing
about until looking at the async uribl plugin the other night):

# /usr/share/qpsmtpd/plugins/mime_parser
use MIME::Parser;
sub parse_mime {
my ( $self, $txn ) = @_;
return $txn-notes('mime_body') if $txn-notes('mime_body');
 a bunch of code to create a $msg ...
$txn-notes( mime_body = $msg );
return $txn-notes('mime_body');
}

# /usr/share/qpstmpd/plugins/some_plugin_that_might_want_mime_parsing
sub init {
my ( $self, $qp, %args ) = @_;
return unless $qp-config('parse_mime') or $args{parse_mime};
$self-isa('mime_parser');
$self-{can_parse_mime} = 1;
}

sub hook_data_post {
my ( $self, $txn ) = @_;
if ( $self-{can_parse_mime} ) {
$self-some_recursive_mime_function( $self-parse_mime );
} else {
$self-regular_body_getline_function( $txn );
}
}


Voila!  A lazy 'plugin model' that is *also* and API, which doesn't 'use
MIME::Parser' until you *want* to use it; furthermore, you don't have to
just put the official mime_parser plugin in there, with a little
modification (or alternatively some care to use the same filename within
an alternate directory) you could use your home-rolled
messier-but-more-efficient version that doesn't use MIME::Parser at all.

I could easily modify my own stuff to do this and test it.  I'd even be
interested in pulling a couple of our own home-grown MIME recursion
methods into it.  If this is indeed the will of the council ;)

p.s. I'm kind of stoke about plugin inheritance these days.  I'm becoming
convinced that after some code churn on our side it will allow us to
finally switch to async without having to do just rewrite all our plugins,
switch to the async daemon, and see what happens.  And with a little churn
on the upstream side, if that manages to happen, I also think it could
allow us to un-fork 90% of the QP plugin code we currently have re-written
and instead submit patches to QP that have some guarantee to be tested and
aren't surrounded by completely useless context :)

-Jared



Re: [Fwd: DoubleCheck DataFeed access]

2010-07-28 Thread Jared Johnson
 Also:  we recently contributed a new URIBL plugin to the Qpsmtpd
 project,
 which makes use of our pruned TLD lists that use your datafeed data.
 They had some question of whether this was something you would be
 comfortable with having distributed publicly.  Note no actual URIBL data
 is distributed, just a list of TLDs that happens to *not* include
 top-level TLDs that would be extremely unlikely to generate hits against
 your service.  This is used to limit the number of extraneous queries to
 your public mirrors, which I'm guessing you would consider beneficial.
 Could you verify whether we have your permission/blessing to distribute
 such a list gleaned from your data?


 Ya, fine.  It doesnt sound like it would have significant impact on
 volume to me, as the top 25 tlds (including ipv4 volume) that are
 queried represent 91% of the total query volume, and the top 100 tld
 represent 99% of the volume.

 If you tell me which tlds are suppressed, I can give you an idea of
 query volume savings according to mirror traffic.

 For example, suppression of .mil and .int would result save 4/100th of a
 percent (0.00039).   Now, if there was a hacked webserver in .mil and
 spammers used it as a drop page or redirector, our temporary listing of
 it would never hit for you if you suppress them.I guess you have to
 weigh the savings versus the potential for abuse.   You wouldnt want to
 supress a TLD that becomes the next spammer haven and have to scramble
 to release an update.   Thinking about recent history such as .tk
 (Tokelau),  .st (Sao Tome), .im (Isle of Man), and others.

There are 135 excluded tlds:

ac ad af ag ai al an ao aq arpa asia aw ax bb bf bi bj bm bn bo bt bw cd
cf cg ci ck coop cr cu cv dj dm do dz edu er et fj fk fm fo ga gf gh gi gl
gm gn gov gp gq gt gu gw gy ht int iq jm jo jobs kh ki km kn kp kw ky lb
lc lk lr ls lu ly mc mg mh mil ml mm mo mp mq mr mt museum mv mw mz na nc
nf ng ni nr om pa pf pg pn ps pw qa re rw sb sd sh sl sm sn sr sv sy sz td
tel tf tg tj tm tn travel ug va vg vi vu wf ye yu zm zw

... I also used the feed data to prune two and three level TLDs from the
SURBL list, which is pretty obviously not based on any data:

coop.br coop.tt gov.ae gov.am gov.ar gov.as gov.au gov.az gov.ba gov.bd
gov.bh gov.bs gov.by gov.bz gov.ch gov.cl gov.cm gov.co gov.cx gov.cy
gov.ec gov.ee gov.eg gov.ge gov.gg gov.gr gov.hk gov.hu gov.ie gov.il
gov.im gov.in gov.io gov.ir gov.is gov.it gov.je gov.jp gov.kg gov.kz
gov.la gov.li gov.lt gov.lv gov.ma gov.me gov.mk gov.mn gov.mu gov.my
gov.np gov.ph gov.pk gov.pl gov.pr gov.pt gov.py gov.rs gov.ru gov.sa
gov.sc gov.sg gov.sk gov.st gov.tl gov.to gov.tp gov.tt gov.tv gov.tw
gov.ua gov.uk gov.vc gov.ve gov.vn gov.ws gov.za jobs.tt tel.no tel.tr
act.gov.au nsw.gov.au nt.gov.au pa.gov.pl po.gov.pl qld.gov.au sa.gov.au
so.gov.pl sr.gov.pl starostwo.gov.pl tas.gov.au ug.gov.pl um.gov.pl
upow.gov.pl uw.gov.pl vic.gov.au wa.gov.au

There were also about _1400_ reserved .us TLDs prune from the list, IIRC

You make some pretty good points.  It may well not be worth the trouble,
at least for one-level TLDs.

Thanks,

Jared Johnson
Software Developer
DoubleCheck Email Manager



Re: per-recipient configuration

2010-07-28 Thread David Nicol
On Wed, Jul 28, 2010 at 2:35 PM, David Nicol davidni...@gmail.com wrote:
 On Wed, Jul 28, 2010 at 2:14 PM, Jared Johnson jjohn...@efolder.net wrote:
 I think we've had a thread about this before, but how do you see the API
 for a standard hook for persistence working?

 either the tie or overload interface is invoked by the plug-in, [...]

so inside QP, the qpsmtpd::address object would have a known parameter
that brings up the per-address persistence hash, which would be a flat
hash. Something like

   $object-{persistent} = (  eval {
$PERSISTENCEFRAMEWORK-new(Address = $object) }  or {} );

at the end of the address object constructor, possibly even more generic.

The persistent element would default to a non-persistent version
when $PERSISTENCEFRAMEWORK isn't set to something that works, and when
it has been configured, per-address config can be loaded or altered
via

  return HARDFAIL if
$AddressObject-{persistent}-{AcceptableSpamScore} 
$Message-getSpamScore

or such


Re: per-recipient configuration

2010-07-28 Thread Jared Johnson
 so inside QP, the qpsmtpd::address object would have a known parameter
 that brings up the per-address persistence hash, which would be a flat
 hash. Something like
..

See in my mind, per-recipient config and persistent data storage are more
separate.  Maybe part of the reason I look at it this way is that in my
own implementation, I never really write config for a recipient, I only
read it (from my persistent storage, the db, of course).  I don't see it
necessary to be able to say $rcpt-config( spam_threshold = 10 ) from QP,
I'd do it from the UI).  Stored things are always written from QP
(logging) and sometimes read (auto whitelist ).  A hook_user_config plugin
would even be likely to make use of the persistent storage itself, but I
still see the concepts as split, and you could implement and benefit from
either one without the other.  When I went to write a reference plugin for
hook_user_config, I just thought of one where an admin can just drop
config for users into directories on the fs in the event that he wants to
override what's already set on the global level; I think the hook
structure even fell through to $qp-config() so it could really just begin
with an extension of /etc/qpsmtpd and go from there, just like
$qp-config() does.

That said, for persistent storage I would like to see a more
straight-forward API and skeleton API; something like:  $txn-get,
$txn-store, $rcpt-get, $rcpt-store, and a corresponding hook_get and
hook_store that are passed ( $self, $txn, [ $rcpt ] ) so it knows what to
key on, though it can even ignore it if it wants.  Extra points if QP
provides a
safe-but-reusing-connections-appropriately-depending-on-the-forking-model
DBI handle via $self-qp-dbh, and maybe a $self-qp-cache() for a
'persistent' Cache::FastMmap or Memcached cache, but the plugin could be
required to establish the actual data store for itself.  Then the plugin
goes to town.  It can store everything in a generic way, key = value or
whatever, or it could map the key names to your own business logic.  The
reference persistent storage plugin could still implement the type of
store you're talking about, and that would work out of the box for people
who just want to have greylisting out of the box or etc; but if I'm
reading it correctly, even if it's really awesome it's likely a lot of
developers would just have to scrap it in favor of what they're already
doing, at which point the more flexibility hook_store plugins have the
better.

# in stock upstream plugins/greylist
$rcpt-store( greylist = $self-qp-connection-remote_ip );

# in our internal storage plugin that overrides the generic one
# I don't think anyone would actually want this in particular :)
sub init { shift-{dbh} = $dbi-connect(...) }
sub dbh { shift-{dbh} }
sub hook_store {
my ( $self, $txn, $rcpt, $conf, $value ) = @_;
return DECLINED unless $conf eq 'greylist';
return OK unless $rcpt-notes('user_uid');
my $sth = $self-dbh-prepare_cached(
INSERT INTO bobsgreylist (ip,helo,rcpt) VALUES (?,?,?) );
$sth-execute( $value,
   $self-qp-connection-hello_host,
   $rcpt-notes('user_uid') );
return OK;
}

Unlike per-recip config, though we don't have any API etc. written
in-house to support generic persistent storage writing, for now we just
stick our DBI directly in our plugins, making them even more forked; so
this is all purely theoretical and David has the advantage of speaking
from some experience :)

-Jared



Re: per-recipient configuration

2010-07-28 Thread David Nicol
 See in my mind, per-recipient config and persistent data storage are more
 separate.  Maybe part of the reason I look at it this way is that in my
 own implementation, I never really write config for a recipient, I only
 read it (from my persistent storage, the db, of course).  I don't see it
 necessary to be able to say $rcpt-config( spam_threshold = 10 ) from QP,
 I'd do it from the UI).  Stored things are always written from QP

I have stuff that might wind up looking like

  $rcpt-persistent-{statistics_receivedmsgs_total}++;
  
$rcpt-persistent-{statistics_receivedstatsbysender}{$NormalizedSenderAddress}++;

in addition to reading configuration.