Re: Rewritten URIBL plugin
So if you're interested in the parsed mime functionality, your plugin can plugin_use util/parsed-mime and the right magic happens. Oh yeah that's right, someone *did* implement what you're talking about. You can do it with 'plugin inheritance' (which ironically i knew nothing about until looking at the async uribl plugin the other night): You do know who that someone was, right? ;) snip Voila! A lazy 'plugin model' that is *also* and API, which doesn't 'use MIME::Parser' until you *want* to use it; furthermore, you don't have to just put the official mime_parser plugin in there, with a little modification (or alternatively some care to use the same filename within an alternate directory) you could use your home-rolled messier-but-more-efficient version that doesn't use MIME::Parser at all. Nice. That does work out to be a nice bit of code reuse. The syntactic sugar is.. well.. lacking, but that's ok. It's understandable! I could easily modify my own stuff to do this and test it. I'd even be interested in pulling a couple of our own home-grown MIME recursion methods into it. If this is indeed the will of the council ;) I like it. But I'm just one person. p.s. I'm kind of stoke about plugin inheritance these days. I'm becoming convinced that after some code churn on our side it will allow us to finally switch to async without having to do just rewrite all our plugins, switch to the async daemon, and see what happens. And with a little churn on the upstream side, if that manages to happen, I also think it could allow us to un-fork 90% of the QP plugin code we currently have re-written and instead submit patches to QP that have some guarantee to be tested and aren't surrounded by completely useless context :) Nice. -R
Re: Rewritten URIBL plugin
Jared Johnson wrote: sub parse_mime { That works, only this should be called parsed_mime because you're asking for the parsed bit, not telling it to parse (every time). Matt.
Re: Rewritten URIBL plugin
Jared Johnson wrote: I think we should probably consider putting support for parsed messages into core, with the parsing done lazily if requested by the API. I forgot, we did kinda think of a couple of reasons not to want an API. depending on where you put it, you may find QP in general depending on MIME::Parser even if it never uses it. The benefit of the plugin is that /etc/qpsmtpd/plugins controls whether you wind up having to install, and take up your memory with, MIME::Parser One thing we've discussed in the past (at least in my imagination) although not quite figured out how to implement, is making plugins act a little more like normal modules, so that one plugin can use another. So if you're interested in the parsed mime functionality, your plugin can plugin_use util/parsed-mime and the right magic happens. Oh yeah that's right, someone *did* implement what you're talking about. You can do it with 'plugin inheritance' (which ironically i knew nothing about until looking at the async uribl plugin the other night): # /usr/share/qpsmtpd/plugins/mime_parser use MIME::Parser; sub parse_mime { my ( $self, $txn ) = @_; return $txn-notes('mime_body') if $txn-notes('mime_body'); a bunch of code to create a $msg ... $txn-notes( mime_body = $msg ); return $txn-notes('mime_body'); } # /usr/share/qpstmpd/plugins/some_plugin_that_might_want_mime_parsing sub init { my ( $self, $qp, %args ) = @_; return unless $qp-config('parse_mime') or $args{parse_mime}; $self-isa('mime_parser'); $self-{can_parse_mime} = 1; } sub hook_data_post { my ( $self, $txn ) = @_; if ( $self-{can_parse_mime} ) { $self-some_recursive_mime_function( $self-parse_mime ); } else { $self-regular_body_getline_function( $txn ); } } Voila! A lazy 'plugin model' that is *also* and API, which doesn't 'use MIME::Parser' until you *want* to use it; furthermore, you don't have to just put the official mime_parser plugin in there, with a little modification (or alternatively some care to use the same filename within an alternate directory) you could use your home-rolled messier-but-more-efficient version that doesn't use MIME::Parser at all. I could easily modify my own stuff to do this and test it. I'd even be interested in pulling a couple of our own home-grown MIME recursion methods into it. If this is indeed the will of the council ;) p.s. I'm kind of stoke about plugin inheritance these days. I'm becoming convinced that after some code churn on our side it will allow us to finally switch to async without having to do just rewrite all our plugins, switch to the async daemon, and see what happens. And with a little churn on the upstream side, if that manages to happen, I also think it could allow us to un-fork 90% of the QP plugin code we currently have re-written and instead submit patches to QP that have some guarantee to be tested and aren't surrounded by completely useless context :) -Jared
Re: Rewritten URIBL plugin
On Sun, 25 Jul 2010, Jared Johnson wrote: The plugin has the following advantages over the original: - Uses MIME::Parser to unpack message text so that we can look for URI's in base64-encoded data, etc., and _not_ look for URI's in noise. I think we should probably consider putting support for parsed messages into core, with the parsing done lazily if requested by the API. Thoughts? Matt.
Re: Rewritten URIBL plugin
I think we should probably consider putting support for parsed messages into core, with the parsing done lazily if requested by the API. It's a good idea, it's only that I haven't yet had the time and justification to write and to test a move from the plugin model (see the attached plugin in my newer post) to an internal API model. The plugin I attached represents a trivially-edited version of a plugin we've been testing in production for quite some time. It probably wouldn't be *that* difficult or risky, though, to take the very same code and put it into the lazy API of your choice. And I think it'd be great, even though we wouldn't actually benefit from the laziness ourselves ;) If you don't hate storing the resulting parsed mime in $txn-notes('mime_body'), though, that would rock, since then it wouldn't break our existing plugin-based forked code ;) -Jared
Re: Rewritten URIBL plugin
I think we should probably consider putting support for parsed messages into core, with the parsing done lazily if requested by the API. I forgot, we did kinda think of a couple of reasons not to want an API. depending on where you put it, you may find QP in general depending on MIME::Parser even if it never uses it. The benefit of the plugin is that /etc/qpsmtpd/plugins controls whether you wind up having to install, and take up your memory with, MIME::Parser You could conditionally 'require', although that does do different things to memory management that I don't understand Another thought, there are different reasons people might want mime to be parsed that benefit from different implementations. If you only want the text data, maybe you want to optimize for that somehow and throw away the binary attachment data, or vice versa (unfortunately there doesn't seem to be a way to do this with MIME::Parser that i've found though...). If you only want the text data *and* you don't mind having it not perfectly reconstructed (like the URIBL) plugin, then perhaps you could come up with a way to parse without MIME::Parser that is more efficient and more lazy (not in the sense you were suggesting earlier :P) -Jared
Re: Rewritten URIBL plugin
Jared Johnson wrote: I think we should probably consider putting support for parsed messages into core, with the parsing done lazily if requested by the API. I forgot, we did kinda think of a couple of reasons not to want an API. depending on where you put it, you may find QP in general depending on MIME::Parser even if it never uses it. The benefit of the plugin is that /etc/qpsmtpd/plugins controls whether you wind up having to install, and take up your memory with, MIME::Parser One thing we've discussed in the past (at least in my imagination) although not quite figured out how to implement, is making plugins act a little more like normal modules, so that one plugin can use another. So if you're interested in the parsed mime functionality, your plugin can plugin_use util/parsed-mime and the right magic happens. -R
Re: Rewritten URIBL plugin
- Introduces support for URIBL services that may not have worked right, at least out of the box, before. Defines the subtle differences between various known URIBL services in order to maximize compatibility. Is it worth pulling some of this config out of the code and putting it into some sort of config file? Probably would be worth it, and not that difficult to do. Any new config scheme should probably be both backward compatible and non-spaghetti. The current scheme looks like: multi.uribl.com 2,4 deny the new plugin does add one additional feature, you can now do: multi.uribl.com 2 deny multi.uribl.com 4 log but $zones defines a bunch of other things - labels for the zone in general and specific masks, check_ip, check_host, host_wildcards_supported, lookup_a_record, lookup_ns_record. Ignoring the label issue, all the other directives are related only to the zone; perhaps we could add a new scheme on top of the mask definition scheme (which we would continue to read in the old way), getting even further away from the one-line-per-zone style: multi.uribl.com action deny multi.uribl.com check_ip 1 multi.uribl.com check_host 1 multi.uribl.com lookup_a_record 0 multi.uribl.com lookup_ns_record 0 multi.uribl.com host_wildcards_support 0 If nobody hates this scheme it does seem like something I could add pretty easily send an updated copy - Uses Net::DNS::Async to simplify the code, and also to ensure the afore-mentioned A and NS lookups will prompt new URIBL lookups in an efficient and simple manner via callbacks Does the code still work with the async qpsmtpd cores? Well... I didn't change the existing plugin a whole lot; I did change from data_handler() registered via register(), to plain old hook_data_post(); this may need changed considering that the comments above each sub (which I also removed) suggested that hook_data_post() and register() were 'not used' for async. These things could easily be changed if needed. However we don't run async anywhere, not even in test environments, so I'm not very qualified to know if they need changed, or give any guarantees. And I'm confused by that bit. From what I can tell, the async/uribl plugin ignores plugins/uribl entirely and uses Qpsmtpd::Plugins::Async::DNSBLBase, which in turn uses ParaDNS. Now I noticed the async rhsbl plugin used Q::P::A::D and that this used ParaDNS before and looked at it briefly to see if I could use it for my purposes, but at the top of the docs it's mentioned that only A lookups are supported, so I skipped it and went to Net::DNS::Async. I don't really know the strongpoints of N::D::A vs. ParaDNS but it did note that the latter more expressly claims to be interested in performance and scalability. So in conclusion: (1) I think I'll look into using ParaDNS and maybe even Q::P::Async::DNSBLBase if appropriate, even though my plugin is not in async; it might still have advantages, and the interesting logic is in the callbacks and not the module use anyway (2) I don't know if it matters whether this particular plugin works with async, I guess it depends on whether the old one really ever did. At least Net::DNS::Async should not make it any more blocking than the old plugin's collect_results(). At any rate somebody who plays in async territory should probably comment before any action is taken on the new plugin :) Attached also is tld_lists.pl, a companion file that needs to be dropped in lib/Qpsmtpd/ which provides the list of first, second, and third level TLDs that we care about. It's derived from our URIBL datafeed as well as Do the owners of that data care about it being used this way? You may be violating any agreement with them. Would they be ok if this was released as an independent CPAN module?Either way, can we structure this as an API instead of just an include file? (1) I suspect that anyone who runs a URIBL service would be thrilled with this information being available for this particular use, because it minimizes unnecessary queries to their services. Consider the old method that sent 2 or 3 queries instead of one because it didn't know which one was right. The fact that SURBL and URIBL both publish the two and three level lists and encourage their use is evidence of this :) (2) First-level TLDs are public information, of course, and with the two-level and three-level lists being public, the interesting business is that the first-level list and the SURBL second and third level lists are pruned based on URIBL's data. So it comes down to whether anyone can be considered to have a claim on the privilege of knowing how to *exclude* results. I guess maybe they could if it came down to it... (3) I'm pretty sure I could get URIBL's explicit permission, but it would take a long time. The founder and sole owner of URIBL was also the original author of DoubleCheck, so I feel we would have a bit of an in :) However even so he always takes forever to get back to us in
Re: Rewritten URIBL plugin
Do the owners of that data care about it being used this way? You may be violating any agreement with them. Would they be ok if this was released as an independent CPAN module?Either way, can we structure this as an API instead of just an include file? Forgot to mention, yes, I'd be happy to add the few lines it would take to make this an API, even if it's just in lib/Qpsmtpd. It might be nice for us to get datafeeds with other services (we're looking to at least get one from DBL, possibly SURBL and imvURI) before publishing to CPAN, mainly so that we know for sure that we have not over-pruned the first level or SURBL second third level lists. As they are though, I consider the lists good enough to warrant production use of this method over the current method in the existing QP uribl plugin. After all any TLD that manages to make it onto other services but has garnered *zero* hits on URIBL, probably has very very few entries on those other services. Anyway, if it's a bit of a licensing/copyright problem to use the data from URIBL, it'll be more of one to use the data from other services as well, but we can probably work that question in from the get-go -Jared
Re: Rewritten URIBL plugin
On Sun, Jul 25, 2010 at 04:43:36AM -0500, Jared Johnson wrote: The plugin has the following advantages over the original: Based on a single read-through of the code, I like most of it. Some assorted observations, though: - Parsing out the MIME sections is worthwhile, though it's really something that ought to be done in the core libs when plugins indicate they need it, or in a utility plugin if that's practical. This was something I meant to bring up a while back -- we have quite a few different content-scanning plugins, and if they're all obliged to invoke MIME::Parser repeatedly it'll be even more costly than that module already is. Not that I have time to do it myself, though. :P - For messages below a few hundred kb or so, consider MIME::Parser-output_to_core(), so we can avoid the fs churn. Annoyingly, MIME::Parser doesn't provide a mechanism to only consider particular mimetypes, or we could skip decoding non-text bodies. - It looks like you're scanning inside all parts of multipart MIME, where doing so only on text parts might be preferable. - The config really needs to live in a config file -- changing or expanding the format is fine if needed, but config that lives in code can't really be edited. I'm not bothered by having per-list defaults in the plugin, but one should be able to fully add a new list or alter an existing one without changing Perl code. - In the unwinding of HTML entity escaping -- I suggest doing amp; last. amp;gt; is gt;, not . the SURBL two-level and three-level lists -- but the SURBL lists are pruned to exclude stuff that we're pretty sure is never going to actually get listed -- when was the last time you saw a spammer use .mil? uribl.com never has, according to our datafeed :) We'll probably see compromised .gov and .mil sites hosting malware, much as we see in other domains. They're not that useful for classical spammer purposes since they can't be registered in the usual fashion. Devin -- Devin \ aqua(at)devin.com, IRC:Requiem; http://www.devin.com Carraway \ 1024D/E9ABFCD2: 13E7 199E DD1E 65F0 8905 2E43 5395 CA0D E9AB FCD2
Re: Rewritten URIBL plugin
On Sun, Jul 25, 2010 at 04:43:36AM -0500, Jared Johnson wrote: The plugin has the following advantages over the original: Based on a single read-through of the code, I like most of it. Some assorted observations, though: - Parsing out the MIME sections is worthwhile, though it's really something that ought to be done in the core libs when plugins indicate they need it, or in a utility plugin if that's practical. This was something I meant to bring up a while back -- we have quite a few different content-scanning plugins, and if they're all obliged to invoke MIME::Parser repeatedly it'll be even more costly than that module already is. Not that I have time to do it myself, though. :P Our internal plugin actually just uses $txn-notes('mime_body') instead of creating a new $msg, because we do indeed have such a utility plugin. A bit short of being transparently handled by internals, but I could at least include code in the plugin to use $txn-notes('mime_body') instead of parsing when that exists. This would also allow our internal plugin to be slightly less forked from what winds up in upstream :) Your idea of doing this internally sounds cool to me, but we're unlikely to get around to implementing it ourselves any time soon given the system that already exists is working OK. And unfortunately our utility plugin is really just one aspect of a bigger plugin that wouldn't really be useful to QP, and we have no method of 'asking' for MIME parsing from said plugin since for our purposes it's *always* necessary to parse mime. - For messages below a few hundred kb or so, consider MIME::Parser-output_to_core(), so we can avoid the fs churn. Annoyingly, MIME::Parser doesn't provide a mechanism to only consider particular mimetypes, or we could skip decoding non-text bodies. The MIME::Parser docs are pretty insistent that using output_to_core is detrimental to performance in general, and should only be bothered with when there are actual disk constraints. I've never tested this theory for really small messages, but it's kept me away from trying it :) If it's not very difficult, perhaps it could be optional via a plugin argument, and later via the handy dandy general mime-parsing-needed-feature if it's ever written - It looks like you're scanning inside all parts of multipart MIME, where doing so only on text parts might be preferable. No, in scan_body() we do: return unless $part-effective_type ~~ [qw( text/plain text/html )]; I think that's what we want anyway... - The config really needs to live in a config file -- changing or expanding the format is fine if needed, but config that lives in code can't really be edited. I'm not bothered by having per-list defaults in the plugin, but one should be able to fully add a new list or alter an existing one without changing Perl code. I agree. If you don't object to the spec I proposed in my reply to Robert, I can get that changed soon. - In the unwinding of HTML entity escaping -- I suggest doing amp; last. amp;gt; is gt;, not . Good idea, thanks :) the SURBL two-level and three-level lists -- but the SURBL lists are pruned to exclude stuff that we're pretty sure is never going to actually get listed -- when was the last time you saw a spammer use .mil? uribl.com never has, according to our datafeed :) We'll probably see compromised .gov and .mil sites hosting malware, much as we see in other domains. They're not that useful for classical spammer purposes since they can't be registered in the usual fashion. Ideally, we should only bother querying services for URIs that have some chance of being listed on those services, regardless of their chance of actually being used in spam. Thus, if the datafeeds don't have any hosts with .mil, there's no point checking them. However, (1) we're only deriving our list from URIBL right now, and even as we expand we can never say that we derived our list from every service that someone might use; (2) even if I ran my tld_list generating script every day against every service and sent in an updated list as soon as something got listed on .mil, there would still be the lag time to get a release out with the updated list and then have people upgrade, during which time there would be some misses. I think that using 'pruned' lists is eventually going to cause some misses somewhere. For our own use, since we have control of update timelines, it's worth the risk, and we also don't have to worry about legal issues with using the data. If either of these is a big enough problem for QP in general, though, you could work around it by using an un-pruned tld_lists.pl. This would give %firstlevel_tlds 263 entries instead of 128; %surbl_secondlevel_tlds would have 3,801 instead of 2,426; and %surbl_thirdlevel_tlds would have 155 instead of 135. If you're worried about the legal issue but *not* the timeliness issue, you could still exclude a few
Re: Rewritten URIBL plugin
Jared Johnson wrote: - Introduces support for URIBL services that may not have worked right, at least out of the box, before. Defines the subtle differences between various known URIBL services in order to maximize compatibility. Is it worth pulling some of this config out of the code and putting it into some sort of config file? Probably would be worth it, and not that difficult to do. Any new config scheme should probably be both backward compatible and non-spaghetti. The current scheme looks like: multi.uribl.com 2,4 deny the new plugin does add one additional feature, you can now do: multi.uribl.com 2 deny multi.uribl.com 4 log but $zones defines a bunch of other things - labels for the zone in general and specific masks, check_ip, check_host, host_wildcards_supported, lookup_a_record, lookup_ns_record. Ignoring the label issue, all the other directives are related only to the zone; perhaps we could add a new scheme on top of the mask definition scheme (which we would continue to read in the old way), getting even further away from the one-line-per-zone style: multi.uribl.com action deny multi.uribl.com check_ip 1 multi.uribl.com check_host 1 multi.uribl.com lookup_a_record 0 multi.uribl.com lookup_ns_record 0 multi.uribl.com host_wildcards_support 0 If nobody hates this scheme it does seem like something I could add pretty easily send an updated copy I don't hate it, FWIW. :) Sticking with one line per zone is NOT a requirement. (Heck, I'd even be willing to go so far as saying we could go to a even more complicated config scheme. One of qpsmtpd's unwritten goals is to NOT pull in half of CPAN to run, but we could expand a little.) - Uses Net::DNS::Async to simplify the code, and also to ensure the afore-mentioned A and NS lookups will prompt new URIBL lookups in an efficient and simple manner via callbacks Does the code still work with the async qpsmtpd cores? Well... I didn't change the existing plugin a whole lot; I did change from data_handler() registered via register(), to plain old hook_data_post(); this may need changed considering that the comments above each sub (which I also removed) suggested that hook_data_post() and register() were 'not used' for async. These things could easily be changed if needed. However we don't run async anywhere, not even in test environments, so I'm not very qualified to know if they need changed, or give any guarantees. And I'm confused by that bit. From what I can tell, the async/uribl plugin ignores plugins/uribl entirely and uses Qpsmtpd::Plugins::Async::DNSBLBase, which in turn uses ParaDNS. In that case, it's possible (and likely) that it may never have worked with async. When time is short, it's easy to just ask questions :) Now I noticed the async rhsbl plugin used Q::P::A::D and that this used ParaDNS before and looked at it briefly to see if I could use it for my purposes, but at the top of the docs it's mentioned that only A lookups are supported, so I skipped it and went to Net::DNS::Async. I don't really know the strongpoints of N::D::A vs. ParaDNS but it did note that the latter more expressly claims to be interested in performance and scalability. So in conclusion: (1) I think I'll look into using ParaDNS and maybe even Q::P::Async::DNSBLBase if appropriate, even though my plugin is not in async; it might still have advantages, and the interesting logic is in the callbacks and not the module use anyway (2) I don't know if it matters whether this particular plugin works with async, I guess it depends on whether the old one really ever did. At least Net::DNS::Async should not make it any more blocking than the old plugin's collect_results(). At any rate somebody who plays in async territory should probably comment before any action is taken on the new plugin :) Attached also is tld_lists.pl, a companion file that needs to be dropped in lib/Qpsmtpd/ which provides the list of first, second, and third level TLDs that we care about. It's derived from our URIBL datafeed as well as Do the owners of that data care about it being used this way? You may be violating any agreement with them. Would they be ok if this was released as an independent CPAN module?Either way, can we structure this as an API instead of just an include file? (1) I suspect that anyone who runs a URIBL service would be thrilled with this information being available for this particular use, because it minimizes unnecessary queries to their services. Consider the old method that sent 2 or 3 queries instead of one because it didn't know which one was right. The fact that SURBL and URIBL both publish the two and three level lists and encourage their use is evidence of this :) (2) First-level TLDs are public information, of course, and with the two-level and three-level lists being public, the interesting business is that the
Re: Rewritten URIBL plugin
And I'm confused by that bit. From what I can tell, the async/uribl plugin ignores plugins/uribl entirely and uses Qpsmtpd::Plugins::Async::DNSBLBase, which in turn uses ParaDNS. In that case, it's possible (and likely) that it may never have worked with async. When time is short, it's easy to just ask questions :) OK, I looked closer and I guess the async plugin actually 'inherits' from the non-async one, so it uses its methods. That's a new topic for me but I think I can come up with the necessary changes to the non-async plugin and maybe a patch to the async one to use it correctly. But I won't be able to test the theory :) -Jared
Re: Rewritten URIBL plugin
Attached also is tld_lists.pl, a companion file that needs to be dropped in lib/Qpsmtpd/ which provides the list of first, second, and third level TLDs that we care about. It's derived from our URIBL datafeed as well as Would they be ok if this was released as an independent CPAN module?Either way, can we structure this as an API instead of just an include file? I've already pretty much done the work to access tld_lists as an API, but I'm now doubting the use unless it is released to CPAN: it really makes the most sense to have Qpsmtpd::TLDLists.pm itself still pull in the contents of tld_lists.pl as I posted them, in a 'require', because those contents are created by an automated script. It seems silly to require the modification of said automated script if we want it to generate different logic. But when Qpsmtpd::TLDLists require's tld_lists.pl, then comes the question 'if Qpsmtpd::TLDLIsts is the only thing that accesses this data, and plugins/uribl is the only thing that uses Qpsmtpd::TLDLists, then why doesn't plugins/uribl just read the data itself? You guys can have it whichever way you want it but I say, don't be afraid to treat a .pl file as if it were basically a static, shipped config file in a format easily read by perl plugins :) -Jared
Re: Rewritten URIBL plugin
- Introduces support for URIBL services that may not have worked right, at least out of the box, before. Defines the subtle differences between various known URIBL services in order to maximize compatibility. Is it worth pulling some of this config out of the code and putting it into some sort of config file? - Uses Net::DNS::Async to simplify the code, and also to ensure the afore-mentioned A and NS lookups will prompt new URIBL lookups in an efficient and simple manner via callbacks Does the code still work with the async qpsmtpd cores? Attached also is tld_lists.pl, a companion file that needs to be dropped in lib/Qpsmtpd/ which provides the list of first, second, and third level TLDs that we care about. It's derived from our URIBL datafeed as well as Do the owners of that data care about it being used this way? You may be violating any agreement with them. Would they be ok if this was released as an independent CPAN module?Either way, can we structure this as an API instead of just an include file? There are some disadvantages to this code. MIME::Parser is probably heavy in terms of resource usage compared to just scanning the message body line-by-line. This would probably not be that difficult to revert again if desired, although I think the advantages of M::P outweigh the disadvantages. It would be cool if it was configurable (i.e. use MIME::Parser or not.) code, etc. At any rate, I've worked hard on this code and am thankful to Devin and the rest of the QP contributors for the starting point, so I hope that someone manages to find this useful and perhaps massage it into upstream :) I don't have time to test or massage it, but the uribl plugin is quite important, so if someone is willing to test, benchmark, and massage, it would be cool to get a more powerful version into core. -R