Re: (Non-) Capturing REs

2011-10-25 Thread Adam Katz
On Mon, 2011-10-24 at 13:58 -0700, Adam Katz wrote:
>> Using special variables like those you mentioned are particularly 
>> bad, [...] That's not to say that the extra memory consumption
>> from an unnecessary grouping doesn't impact performance.

On 10/24/2011 02:45 PM, Karsten Bräckelmann wrote:
> Well, does it? Measurably? Even if the RE does *not* match?

If the RE doesn't match, I doubt it.  Not sure though.

> If so, does it still have any measurable effect, if we're talking a 
> handful custom rules, with stock rules using non-capturing grouping?
> (The objective here is a trade-off between optimized REs and not 
> confusing users who aren't intimately familiar with REs. They tend to
> get heavy to grasp rather quickly, and the extra ?: weird chars don't
> help that.)

Interesting point.  Maybe we shouldn't get into such detail with an
admin that just wants to add a few rules.

Also, there are better ways to optimize rules; e.g. assuming matchers
don't consume memory if the RE doesn't match, starting the RE
unambiguously -- non-parenthetical, non-globbed, non-character-class,
etc, ideally starting with a rare character; /\bfoo (bar|vaz)/ is better
than /(foo bar|foo vaz)/ while perl's left-to-right nature makes the
gain on /\b(hello|goodbye) world\b/ over /(hello world|goodbye world)/
far less notable (it's only notable if lots of other things commonly
follow "hello").

>>> Is it really worth it, religiously using non-capturing grouping?
>> 
>> From the profiling I've seen, yes it is.  (I don't have data to 
>> share though, sorry).
> 
> The profiled code, does it use the special match capturing variables
> *anywhere* in the entire program? The profiled and compared 
> versions, would that be like the equivalent of using capturing vs 
> non-capturing in all SA stock rules?

I'm not sure, though I seem to recall the SA debug output includes the
matched text (which implies $&), though if this were important, I'm sure
we'd have already concluded it worthwhile to do stupid things like
surrounding entire regexps with (?=this).

> Not trying to be confrontational, just honestly asking and wondering
> about the real impact. After all, the perlre docs specifically 
> mention to strongly prefer non-capturing grouping basically once
> only -- in the warning paragraph about the special vars.

The perl docs may have cut that for simplicity, just as you're
suggesting above ;-)

In reality, optimizing (including gaming of Perl's built-in
optimizations) is quite non-trivial.  Here's an excerpt from O'Reilly's
Mastering Regular Expressions (2nd Ed, page 253):

> Let me give a somewhat crazy example: you find  (000|999)$  in a Perl
> script, and decide to turn those capturing parentheses into 
> non-capturing parentheses. This should make things a bit faster, you
> think, since the overhead of capturing can now be eliminated. But 
> surprise, this small and seemingly beneficial change can slow this 
> regex down by /several orders of magnitude/ (thousands and thousands 
> of times slower). /What!?/ It turns out that a number of factors come
> together just right in this example to cause the end of 'string/line
> anchor optimization' (pg 246) to be turned off when non-capturing 
> parentheses are used. I don't want to dissuade you from using 
> non-capturing parentheses with Perl--their use is beneficial in the 
> vast majority of cases--but in this particular case, it's a
> disaster.



signature.asc
Description: OpenPGP digital signature


Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread John Hardin

On Tue, 25 Oct 2011, RW wrote:


On Tue, 25 Oct 2011 06:28:41 -0700 (PDT)
John Hardin wrote:


Seconded. MTAs typically have efficient facilities for white- or
black-listing specific email addresses. Use the capabilities of your
MTA and glue layer to completely bypass SA for those addresses since
you _know_ you want to receive mail from them.


The downside to that is that it's not going through Bayes, so there's
no auto-learning or atime updates. So when someone with a whitelisted
address delegates, moves-on, or uses a different account, Bayes may be
less well prepared than it would otherwise be. I suspect that in some
cases MTA whitelisting may actually lead to a worse FP rate than doing
nothing - particularly where BAYES_00 has been given a more substantial
score.


Modulo manual training with classified & miss corpora, of course. I 
distrust autolearn, but then I've never administered SA in a large user 
environment.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  It is not the business of government to make men virtuous or
  religious, or to preserve the fool from the consequences of his own
  folly.  -- Henry George
---
 320 days since the first successful private orbital launch (SpaceX)


Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread RW
On Tue, 25 Oct 2011 06:28:41 -0700 (PDT)
John Hardin wrote:

 
> Seconded. MTAs typically have efficient facilities for white- or 
> black-listing specific email addresses. Use the capabilities of your
> MTA and glue layer to completely bypass SA for those addresses since
> you _know_ you want to receive mail from them.


The downside to that is that it's not going through Bayes, so there's
no auto-learning or atime updates. So when someone with a whitelisted
address delegates, moves-on, or uses a different account, Bayes may be
less well prepared than it would otherwise be. I suspect that in some
cases MTA whitelisting may actually lead to a worse FP rate than doing
nothing - particularly where BAYES_00 has been given a more substantial
score.


Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread Benny Pedersen

On Tue, 25 Oct 2011 11:21:07 +0200, Robert Schetterer wrote:

you should choose another way for whitelisting,
i.e bypass spamassassin for trusted server ips etc
anyway why not using i.e. whitelist_from *@somebody.co ?


this open forges to numbers of equal senders recipient, never seen in 
my logs, so if mta is not checking sender auth then dont use 
whitelist_from, its safe to use whitelist_auth


Re: Where is plugin directory on a personal install?

2011-10-25 Thread Matus UHLAR - fantomas

On 25.10.11 09:27, Pat Traynor wrote:

Upgrading the ancient spamassassin on my server is looking to be a scary
proposition, so I did my own personal install.  It's working fine, but a
lot of spam is still coming through, and I'd like to so some tweaking.

Where is the plugin directory if you do your own personal install?

I can't find a "Plugin" directory anywhere in my home directory, aside
from the one in the folder where I initially extracted Spamassassin to
do the make and install.  It doesn't use *that*, does it?


plusing are usually installed in perl5 directory, e.g. 
/usr/share/perl5/Mail/SpamAssassin/Plugin/DCC.pm for 
Mail::SpamAssassin::Plugin::DCC


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux - It's now safe to turn on your computer.
Linux - Teraz mozete pocitac bez obav zapnut.


Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread John Hardin

On Tue, 25 Oct 2011, Robert Schetterer wrote:


Am 25.10.2011 09:51, schrieb SuperDuper:


I am planning on exporting a list of our client's email addresses into a file
with 5000 separate lines as such:
whitelist_from cli...@somebody.co


you should choose another way for whitelisting,
i.e bypass spamassassin for trusted server ips etc


Seconded. MTAs typically have efficient facilities for white- or 
black-listing specific email addresses. Use the capabilities of your MTA 
and glue layer to completely bypass SA for those addresses since you 
_know_ you want to receive mail from them.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  False is the idea of utility that sacrifices a thousand real
  advantages for one imaginary or trifling inconvenience; that would
  take fire from men because it burns, and water because one may drown
  in it; that has no remedy for evils except destruction. The laws
  that forbid the carrying of arms are laws of such a nature. They
  disarm only those who are neither inclined nor determined to commit
  crime.   -- Cesare Beccaria, quoted by Thomas Jefferson
---
 320 days since the first successful private orbital launch (SpaceX)


Where is plugin directory on a personal install?

2011-10-25 Thread Pat Traynor

Upgrading the ancient spamassassin on my server is looking to be a scary
proposition, so I did my own personal install.  It's working fine, but a
lot of spam is still coming through, and I'd like to so some tweaking.

Where is the plugin directory if you do your own personal install?

I can't find a "Plugin" directory anywhere in my home directory, aside
from the one in the folder where I initially extracted Spamassassin to
do the make and install.  It doesn't use *that*, does it?

--pat--
--
Pat Traynor
p...@ssih.com


Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread Martin Gregorie
On Tue, 2011-10-25 at 00:51 -0700, SuperDuper wrote:
> I am planning on exporting a list of our client's email addresses into a file
> with 5000 separate lines as such:
> whitelist_from cli...@somebody.co
> 
I do essentially the same thing with an SA plugin and rule plus a
database. 

Background: I archive all incoming and outgoing mail in a PostgreSQL
database because it keeps my mail folders nice and empty while making
access to archived mail somewhat faster than searching through mail
folders is. The archive schema includes a view that contains only the
addresses of people I've sent mail to. The plugin does lookups on this
view and has an associated rule that whitelists hits by applying a
suitably large negative score. The benefit of handling whitelisting this
way is that updating is completely automatic and doesn't require SA to
be stopped and restarted each time the list changes: every time I write
or reply to a new correspondent they appear in the view.

Suggestion: there is nothing to stop the plugin from doing its lookups
against a table provided that it contains at least the same column as
the view and you have a way of keeping the table's contents up to date.
The view looks like this:

create view whitelist as
select  distinct email
fromaddress a, addresstype t
where   a.archive='yes' and 
a.self = 'no' and
a.sdbk=t.asdbk and 
t.type='To';

So a table like the following should be fine and is probably general
enough for it to be used without modification by any RDBMS. Of course it
can have other columns that help to maintain the table and/or make it
useful for other related tasks, e.g. a client list:

create table whitelist 
(
email varchar(80) primary key
);

If this sounds useful to you, the plugin is available here:
http://www.libelle-systems.com/downloads/ma/docs/manual/whitelisting.html

I should probably package the plugin with a table definition and make it
available for freestanding use but that hasn't happened yet: maybe I
should make that my next mini-project.


Martin





Re: 5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread Robert Schetterer
Am 25.10.2011 09:51, schrieb SuperDuper:
> 
> I am planning on exporting a list of our client's email addresses into a file
> with 5000 separate lines as such:
> whitelist_from cli...@somebody.co
> 
> 
> I'm running an Apple XServe with Intel Xeon Quadcores and 6Gb RAM -
> processor fairly underutilised at the moment.  Is 5000 whitelist entries
> expected to have a dramatic performance influence?
> 
> Also, further to this, will replacing the whitelist_from with whitelist_auth
> make a dramatic difference?
> 
> Approximately what percentage of servers out there arel configured correctly
> so that whitelist_auth works correctly?
> 
> 
you should choose another way for whitelisting,
i.e bypass spamassassin for trusted server ips etc
anyway why not using i.e. whitelist_from *@somebody.co ?

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


5000 x whitelist_from or whitelist_auth entries - performance hit?

2011-10-25 Thread SuperDuper

I am planning on exporting a list of our client's email addresses into a file
with 5000 separate lines as such:
whitelist_from cli...@somebody.co


I'm running an Apple XServe with Intel Xeon Quadcores and 6Gb RAM -
processor fairly underutilised at the moment.  Is 5000 whitelist entries
expected to have a dramatic performance influence?

Also, further to this, will replacing the whitelist_from with whitelist_auth
make a dramatic difference?

Approximately what percentage of servers out there arel configured correctly
so that whitelist_auth works correctly?


-- 
View this message in context: 
http://old.nabble.com/5000-x-whitelist_from--or--whitelist_auth-entries---performance-hit--tp32715552p32715552.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.