Re: (Non-) Capturing REs
On Mon, 2011-10-24 at 13:58 -0700, Adam Katz wrote: >> Using special variables like those you mentioned are particularly >> bad, [...] That's not to say that the extra memory consumption >> from an unnecessary grouping doesn't impact performance. On 10/24/2011 02:45 PM, Karsten Bräckelmann wrote: > Well, does it? Measurably? Even if the RE does *not* match? If the RE doesn't match, I doubt it. Not sure though. > If so, does it still have any measurable effect, if we're talking a > handful custom rules, with stock rules using non-capturing grouping? > (The objective here is a trade-off between optimized REs and not > confusing users who aren't intimately familiar with REs. They tend to > get heavy to grasp rather quickly, and the extra ?: weird chars don't > help that.) Interesting point. Maybe we shouldn't get into such detail with an admin that just wants to add a few rules. Also, there are better ways to optimize rules; e.g. assuming matchers don't consume memory if the RE doesn't match, starting the RE unambiguously -- non-parenthetical, non-globbed, non-character-class, etc, ideally starting with a rare character; /\bfoo (bar|vaz)/ is better than /(foo bar|foo vaz)/ while perl's left-to-right nature makes the gain on /\b(hello|goodbye) world\b/ over /(hello world|goodbye world)/ far less notable (it's only notable if lots of other things commonly follow "hello"). >>> Is it really worth it, religiously using non-capturing grouping? >> >> From the profiling I've seen, yes it is. (I don't have data to >> share though, sorry). > > The profiled code, does it use the special match capturing variables > *anywhere* in the entire program? The profiled and compared > versions, would that be like the equivalent of using capturing vs > non-capturing in all SA stock rules? I'm not sure, though I seem to recall the SA debug output includes the matched text (which implies $&), though if this were important, I'm sure we'd have already concluded it worthwhile to do stupid things like surrounding entire regexps with (?=this). > Not trying to be confrontational, just honestly asking and wondering > about the real impact. After all, the perlre docs specifically > mention to strongly prefer non-capturing grouping basically once > only -- in the warning paragraph about the special vars. The perl docs may have cut that for simplicity, just as you're suggesting above ;-) In reality, optimizing (including gaming of Perl's built-in optimizations) is quite non-trivial. Here's an excerpt from O'Reilly's Mastering Regular Expressions (2nd Ed, page 253): > Let me give a somewhat crazy example: you find (000|999)$ in a Perl > script, and decide to turn those capturing parentheses into > non-capturing parentheses. This should make things a bit faster, you > think, since the overhead of capturing can now be eliminated. But > surprise, this small and seemingly beneficial change can slow this > regex down by /several orders of magnitude/ (thousands and thousands > of times slower). /What!?/ It turns out that a number of factors come > together just right in this example to cause the end of 'string/line > anchor optimization' (pg 246) to be turned off when non-capturing > parentheses are used. I don't want to dissuade you from using > non-capturing parentheses with Perl--their use is beneficial in the > vast majority of cases--but in this particular case, it's a > disaster. signature.asc Description: OpenPGP digital signature
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
On Tue, 25 Oct 2011, RW wrote: On Tue, 25 Oct 2011 06:28:41 -0700 (PDT) John Hardin wrote: Seconded. MTAs typically have efficient facilities for white- or black-listing specific email addresses. Use the capabilities of your MTA and glue layer to completely bypass SA for those addresses since you _know_ you want to receive mail from them. The downside to that is that it's not going through Bayes, so there's no auto-learning or atime updates. So when someone with a whitelisted address delegates, moves-on, or uses a different account, Bayes may be less well prepared than it would otherwise be. I suspect that in some cases MTA whitelisting may actually lead to a worse FP rate than doing nothing - particularly where BAYES_00 has been given a more substantial score. Modulo manual training with classified & miss corpora, of course. I distrust autolearn, but then I've never administered SA in a large user environment. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- It is not the business of government to make men virtuous or religious, or to preserve the fool from the consequences of his own folly. -- Henry George --- 320 days since the first successful private orbital launch (SpaceX)
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
On Tue, 25 Oct 2011 06:28:41 -0700 (PDT) John Hardin wrote: > Seconded. MTAs typically have efficient facilities for white- or > black-listing specific email addresses. Use the capabilities of your > MTA and glue layer to completely bypass SA for those addresses since > you _know_ you want to receive mail from them. The downside to that is that it's not going through Bayes, so there's no auto-learning or atime updates. So when someone with a whitelisted address delegates, moves-on, or uses a different account, Bayes may be less well prepared than it would otherwise be. I suspect that in some cases MTA whitelisting may actually lead to a worse FP rate than doing nothing - particularly where BAYES_00 has been given a more substantial score.
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
On Tue, 25 Oct 2011 11:21:07 +0200, Robert Schetterer wrote: you should choose another way for whitelisting, i.e bypass spamassassin for trusted server ips etc anyway why not using i.e. whitelist_from *@somebody.co ? this open forges to numbers of equal senders recipient, never seen in my logs, so if mta is not checking sender auth then dont use whitelist_from, its safe to use whitelist_auth
Re: Where is plugin directory on a personal install?
On 25.10.11 09:27, Pat Traynor wrote: Upgrading the ancient spamassassin on my server is looking to be a scary proposition, so I did my own personal install. It's working fine, but a lot of spam is still coming through, and I'd like to so some tweaking. Where is the plugin directory if you do your own personal install? I can't find a "Plugin" directory anywhere in my home directory, aside from the one in the folder where I initially extracted Spamassassin to do the make and install. It doesn't use *that*, does it? plusing are usually installed in perl5 directory, e.g. /usr/share/perl5/Mail/SpamAssassin/Plugin/DCC.pm for Mail::SpamAssassin::Plugin::DCC -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Linux - It's now safe to turn on your computer. Linux - Teraz mozete pocitac bez obav zapnut.
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
On Tue, 25 Oct 2011, Robert Schetterer wrote: Am 25.10.2011 09:51, schrieb SuperDuper: I am planning on exporting a list of our client's email addresses into a file with 5000 separate lines as such: whitelist_from cli...@somebody.co you should choose another way for whitelisting, i.e bypass spamassassin for trusted server ips etc Seconded. MTAs typically have efficient facilities for white- or black-listing specific email addresses. Use the capabilities of your MTA and glue layer to completely bypass SA for those addresses since you _know_ you want to receive mail from them. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- False is the idea of utility that sacrifices a thousand real advantages for one imaginary or trifling inconvenience; that would take fire from men because it burns, and water because one may drown in it; that has no remedy for evils except destruction. The laws that forbid the carrying of arms are laws of such a nature. They disarm only those who are neither inclined nor determined to commit crime. -- Cesare Beccaria, quoted by Thomas Jefferson --- 320 days since the first successful private orbital launch (SpaceX)
Where is plugin directory on a personal install?
Upgrading the ancient spamassassin on my server is looking to be a scary proposition, so I did my own personal install. It's working fine, but a lot of spam is still coming through, and I'd like to so some tweaking. Where is the plugin directory if you do your own personal install? I can't find a "Plugin" directory anywhere in my home directory, aside from the one in the folder where I initially extracted Spamassassin to do the make and install. It doesn't use *that*, does it? --pat-- -- Pat Traynor p...@ssih.com
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
On Tue, 2011-10-25 at 00:51 -0700, SuperDuper wrote: > I am planning on exporting a list of our client's email addresses into a file > with 5000 separate lines as such: > whitelist_from cli...@somebody.co > I do essentially the same thing with an SA plugin and rule plus a database. Background: I archive all incoming and outgoing mail in a PostgreSQL database because it keeps my mail folders nice and empty while making access to archived mail somewhat faster than searching through mail folders is. The archive schema includes a view that contains only the addresses of people I've sent mail to. The plugin does lookups on this view and has an associated rule that whitelists hits by applying a suitably large negative score. The benefit of handling whitelisting this way is that updating is completely automatic and doesn't require SA to be stopped and restarted each time the list changes: every time I write or reply to a new correspondent they appear in the view. Suggestion: there is nothing to stop the plugin from doing its lookups against a table provided that it contains at least the same column as the view and you have a way of keeping the table's contents up to date. The view looks like this: create view whitelist as select distinct email fromaddress a, addresstype t where a.archive='yes' and a.self = 'no' and a.sdbk=t.asdbk and t.type='To'; So a table like the following should be fine and is probably general enough for it to be used without modification by any RDBMS. Of course it can have other columns that help to maintain the table and/or make it useful for other related tasks, e.g. a client list: create table whitelist ( email varchar(80) primary key ); If this sounds useful to you, the plugin is available here: http://www.libelle-systems.com/downloads/ma/docs/manual/whitelisting.html I should probably package the plugin with a table definition and make it available for freestanding use but that hasn't happened yet: maybe I should make that my next mini-project. Martin
Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?
Am 25.10.2011 09:51, schrieb SuperDuper: > > I am planning on exporting a list of our client's email addresses into a file > with 5000 separate lines as such: > whitelist_from cli...@somebody.co > > > I'm running an Apple XServe with Intel Xeon Quadcores and 6Gb RAM - > processor fairly underutilised at the moment. Is 5000 whitelist entries > expected to have a dramatic performance influence? > > Also, further to this, will replacing the whitelist_from with whitelist_auth > make a dramatic difference? > > Approximately what percentage of servers out there arel configured correctly > so that whitelist_auth works correctly? > > you should choose another way for whitelisting, i.e bypass spamassassin for trusted server ips etc anyway why not using i.e. whitelist_from *@somebody.co ? -- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria
5000 x whitelist_from or whitelist_auth entries - performance hit?
I am planning on exporting a list of our client's email addresses into a file with 5000 separate lines as such: whitelist_from cli...@somebody.co I'm running an Apple XServe with Intel Xeon Quadcores and 6Gb RAM - processor fairly underutilised at the moment. Is 5000 whitelist entries expected to have a dramatic performance influence? Also, further to this, will replacing the whitelist_from with whitelist_auth make a dramatic difference? Approximately what percentage of servers out there arel configured correctly so that whitelist_auth works correctly? -- View this message in context: http://old.nabble.com/5000-x-whitelist_from--or--whitelist_auth-entries---performance-hit--tp32715552p32715552.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.