Perl amateur here, be gentle, trying to help... ( Sending to dev list  
for hopefully more answers )

$./rebuildspamdb.pl
LWP::Simple 1.41 installed - download griplist available

******************************************************
RebuildSpamDB 2.6.1.1 (1.0.00) started - Thu Dec 24 12:19:13 2009

I was playing with the v1 rebuildspamdb.pl locally, and took a few  
thousand of my IMAP mails from the server and copied them locally.  I  
then duplicated them to spam and notspam to have a set of around  
50,000 messages each.

One thing I notice, the subroutine that does:
     sub processfolder {
        &printlog("\nProcessing...");
     }

When there are a lot of files, it will take a long time for the line  
"Processing..." to be printed, so there is a lot of work going on.  I  
can print my own debugging statements ahead of and below, and they are  
not printed until the sub routine is entirely done. Line number  
suggestions?

The sub routine seems that it must be recursing all mails in each of a  
few known directories, all of which it reports, on, and never once  
failed on for me.  However, I want it to yield CPU time within the  
routine in order to print debug data of my own as it is iterating.

A few data metrics I desire:
Name of file being worked on, with path
Size of file being worked on, bytes
First line of file - looking for nulls and stray ff's
Last line of file - looking for nulls and stray ff's
Separate log: name of file \n contents of file, headers, then body if  
possible
* basically, just general access to the file, but in real time, not  
shoved out in one lump at the end, I want to see it as it is run.

Where would I work to add these in order to have these print lines put  
in?

Here is the source as I was working on on it, unaltered, so the l\nibe  
numbers should accurate.

A few other things I saw as I was reviewing...

Lines 167 - 175 have a hard line break in the code.
             elsif ( $norm < 1.1   ) { $normdesc = '(very good -
     balanced)'; }

     heavy)'; }
             &printlog( "Corpus norm:\t%.4f %s $normdesc \n", $norm  );

Does this matter to perl in any significant way? ( trailing and open  
close braces "}" ) I would get a parse error in any other language  
that I am familiar with.  Don't all these \n's and spaces and up in  
the log, making log lines no longer lines, but milt line?  Would some  
type of sir syiftmmm solvve ghidk

Lines 154 to 181, I can not tell which conditions begin and end  
where.  Are lines 167 - 175 part of an else, or do they stand alone?

I did some cleaning up: http://pastie.org/755914
Lines 7 - 11, is that an "else" just a little malformed, and perl is  
OK with letting that multi-line bracing happen?  To me, it looks like  
lines 7 - 9 are just a single if with a single terminating brace.  If  
that is the case, then how does that hanging else not cause an error?

As I mentioned, total perl beginner here, but looking at if/else/ 
elseif docs,  I can not see why a middle of condition terminator is  
needed in an elseif, unless that is perl's "break;"
     elsif ( $norm < 0.9   ) {
        $normdesc = '(ok - slighly ham heavy)'; } <- What is this for?

I made a version of that chunk in php, perhaps this could be more  
flexible than a big chunk of conditions, and allow more granular  
control? I do not know I will get to a full understanding of perl, so  
this may not be of any value, other than I believe an array of $norm  
and their "values" are easier in the long run, as you never edit a  
condition, just add and remove array name/value pairs. *php floats are  
silly
http://pastie.org/756091

I then took the file and re-indented and cleaned the braces so I could  
begin to understand it:
http://pastie.org/756094
It does appear, that lines 21 -36 in that pastie, is one conditional  
block

Does perl not have a way to enforce some form of code consistency?
Original File, line 163 -164
     }
     else {               #norm exists, print it

Altered file, same area:
     } else {   #norm exists, print it

And finally, original file, line 185 - 187
     if   ( time - $starttime != 0 ) { $processTime = time -  
$starttime; }
     else                            { $processTime = 1; }
     &printlog( "\nTotal processing time: %d second(s)\n\n",  
$processTime );

I see three different "cuddle" brace styles/assignments, and am not  
sure if this is style, or methodologies in perl.  Can someone who  
knows more about perl elaborate with these cases?

Lines 190 - 193
     if ( !$noGriplist || !$asspLog ) { &uploadgriplist(); }
     &downloadgriplist();
     &downloaddroplist();

Is that,,, if false of $noGriplist OR false $asspLog then toss  
uploadgriplist() into the background and also run downloadgriplist()  
and downloaddroplist() in the background, or do the last two griplist  
commands happen regardless of the nearby condition?

     if (condition) {
       &uploadgriplist()
       &downloadgriplist();
       &downloaddroplist();
     }

  -- or --

     if (condition) {
       &uploadgriplist()
     }

     &downloadgriplist(); &downloaddroplist();

In general, in all of ASSP, there are many commands that start with  
"do" or "no" or similar.  Leaving you with (!$noGriplist).

What does that mean?  If there is not not a Griplist?  In these case,  
I would almost lean on:
     if ($noGriplist == false) {
just to make it more clear to someone who wants to get up to speed.

How about just moving away from using +/- in naming conventions, stuff  
like this just requires a lot of mental thinking as you are moving  
throough the code:
     ( $DoNotCollectRed || $DoNotCollectRedList )
   -- or --
     (! $DoNotCollectRed || ! $DoNotCollectRedList )
  -- or --
     ( $DoNotCollectRed || ! $DoNotCollectRedList )

Or am I misunderstanding and for every negation of a variable, there  
is a positive counterpart, so for the "DoNotCollectRed" there is also  
a "DoCollectRed" ?

This leads me, to in rebuildspamdb.pl, I am looking for suggested ways  
to inject logging just for std out testing.  I want to know which  
message it is working on, and find what is is getting stuck on.  Once  
the stuck is fixed, then I believe there is already some form of a  
timer in place, of the time spent on one message is more than a fixed  
time, that message can be moved into a "suspect" directory. A small  
modification to rebuildspampb.pl could take an argument, ./ 
rebuildspamdb.pl suspect/* and take it from there, until that script  
is bulletproof.

I don't want to fix just this one message, but make a system where  
other messages do not hang the system, up and bad messages can time out,

One other thing I noticed, my report, and others from other servers,  
not always easy to get to:
Total processing time: 6681 second(s)

/usr/local/assp/spam
File Count: 27,752

/usr/local/assp/notspam
File Count: 29,727

Imported Files: 29,727
Finished in 798 second(s)

HELO Blacklist: 596 HELOs
What is HELO blacklist doing?  Looking over a non ASSP server, and a  
few others as well, but all in same percentage numbers, thats really  
low:

I do the bulk of my spam processing work in the ehlo/helo phase, with  
DNSBL coming in just before that for outright DNS blocking.  My logic  
is all about CPU. DNS lookup is pre ehlo/helo, at a connection level,  
so I can avoid almost all CPU hit by doing that test first. If it is a  
DNSWL (whitelist), I can then also pass the entire message with no  
more CPU penalty.

Next, I do my ehlo/helo checks.  I can not get the first hit date  
easily, but I can say that rule:
ehlo/helo does not contain "." has blocked 17 million emails, maybe 2  
false possitives, but I sinply told the postmaster how to fix it, and  
they were good.

Next, is ( and keep in mind, I have a weaker regex engine that ASSP on  
this server ) *No Or, AND, or boolean support, full greedy.

helo/ehlo matches pattern [a-zA-Z_-]*[0-9]{1,3}-[0-9]{1,3}-[0-9]{1,3}- 
[0-9]{1,3}.*
then blacklist host - 864,000 hits.  This finds what looks like  
dynamic address space. I block outright.  tTese are all the IP's that  
should be in a DNSBL, but for some reason, are not.

The other great one is ehlo/helo starts with "[" or ends with "]" does  
not matter, 400,000 hits.  You can debate about this one, but it has  
not ever caused me any false positives.

A few more:
helo/ehlo...
ends wit .lan  - 53,000
is my ip subet - 15,000
is yahoo.com, aol.com, google.com, microsoft.com, hostmail.com, etc -  
about 15,000 each host,, I have a ton of hosts.

There are a ton of good hosts to hit, stuff like these are a huge win,  
and every ISP has a range of cable, DSL, or modems - [0-9-]+ 
\.subnet[0-9-]+\.[^.]+\.telkom\.net\.id

contains .dsl. .ppoe. .ppp. .cable. etc etc
Little more risky, you can broaden those rules to include more,  
something like this is much safer:
a?dsl[-.]?dynamic[-.].*

These are where you will decide to block outright, or pass them off  
for weighted processing.  I just outright block, never had an issue,  
there are so many forged ehlo/helo patterns, and you can usually  
explode those out into a /24 at least better.

A few other things, certain file attachmenys are for sure a no no.  
Also, if you have a custmer who lezves, and should not get any emai  
agin, I would not honeypot them, but you can Refuse the Connection  
very high uo in the chainm reducug CPU greatly.

The regex;s are so simple, that even more obscure blocks:
ehlo/helo is chello.nl, since `dig mx chello.nl`

Then there are just moron hosts that have not opped 25 non auth on  
tjeeer
[0-9]+\.user\.veloxzone\.com\.br

I am just looking to have time stared on a message, time spent on a  
message, headers, body, off characters, and path to the message.   
running timed execution could reveail help narow down the filer, sse  
whayt is goig on.  If you vsn sfdd regrc pstersnd, bryyrtl rbrnsd
-- 
Scott * If you contact me off list replace talklists@ with scott@ *

> Someone explain this to me like I am a 2 year old...
> Where *and how* do I download rebuildspamdb.pl for v2?
>
> Thanks,
> --  
> Scott * If you contact me off list replace talklists@ with scott@ *
>
>> Paul, next time it hangs can you look at the rebuildrun.txt while  
>> it's
>> hanging?  This is the tail of mine and oddly it didn't complete the
>> full
>> string before stopping..  Very odd.  Wish I knew which email it was
>> processing at the time.
>>
>> plain ~]# tail -f  /usr/local/assp/rebuildrun.txt
>> remove /usr/local/assp/spam/6973.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/7013.eml Regex:Red 'Windows-1251'
>> remove /usr/local/assp/spam/7130.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/7317.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/7613.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/7966.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/80.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/8073.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/8354.eml Regex:Red 'Windows-1251'
>> remove /usr/lo
>
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast  
> and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Assp-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/assp-user

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to