Perl amateur here, be gentle, trying to help... ( Sending to dev list
for hopefully more answers )
$./rebuildspamdb.pl
LWP::Simple 1.41 installed - download griplist available
******************************************************
RebuildSpamDB 2.6.1.1 (1.0.00) started - Thu Dec 24 12:19:13 2009
I was playing with the v1 rebuildspamdb.pl locally, and took a few
thousand of my IMAP mails from the server and copied them locally. I
then duplicated them to spam and notspam to have a set of around
50,000 messages each.
One thing I notice, the subroutine that does:
sub processfolder {
&printlog("\nProcessing...");
}
When there are a lot of files, it will take a long time for the line
"Processing..." to be printed, so there is a lot of work going on. I
can print my own debugging statements ahead of and below, and they are
not printed until the sub routine is entirely done. Line number
suggestions?
The sub routine seems that it must be recursing all mails in each of a
few known directories, all of which it reports, on, and never once
failed on for me. However, I want it to yield CPU time within the
routine in order to print debug data of my own as it is iterating.
A few data metrics I desire:
Name of file being worked on, with path
Size of file being worked on, bytes
First line of file - looking for nulls and stray ff's
Last line of file - looking for nulls and stray ff's
Separate log: name of file \n contents of file, headers, then body if
possible
* basically, just general access to the file, but in real time, not
shoved out in one lump at the end, I want to see it as it is run.
Where would I work to add these in order to have these print lines put
in?
Here is the source as I was working on on it, unaltered, so the l\nibe
numbers should accurate.
A few other things I saw as I was reviewing...
Lines 167 - 175 have a hard line break in the code.
elsif ( $norm < 1.1 ) { $normdesc = '(very good -
balanced)'; }
heavy)'; }
&printlog( "Corpus norm:\t%.4f %s $normdesc \n", $norm );
Does this matter to perl in any significant way? ( trailing and open
close braces "}" ) I would get a parse error in any other language
that I am familiar with. Don't all these \n's and spaces and up in
the log, making log lines no longer lines, but milt line? Would some
type of sir syiftmmm solvve ghidk
Lines 154 to 181, I can not tell which conditions begin and end
where. Are lines 167 - 175 part of an else, or do they stand alone?
I did some cleaning up: http://pastie.org/755914
Lines 7 - 11, is that an "else" just a little malformed, and perl is
OK with letting that multi-line bracing happen? To me, it looks like
lines 7 - 9 are just a single if with a single terminating brace. If
that is the case, then how does that hanging else not cause an error?
As I mentioned, total perl beginner here, but looking at if/else/
elseif docs, I can not see why a middle of condition terminator is
needed in an elseif, unless that is perl's "break;"
elsif ( $norm < 0.9 ) {
$normdesc = '(ok - slighly ham heavy)'; } <- What is this for?
I made a version of that chunk in php, perhaps this could be more
flexible than a big chunk of conditions, and allow more granular
control? I do not know I will get to a full understanding of perl, so
this may not be of any value, other than I believe an array of $norm
and their "values" are easier in the long run, as you never edit a
condition, just add and remove array name/value pairs. *php floats are
silly
http://pastie.org/756091
I then took the file and re-indented and cleaned the braces so I could
begin to understand it:
http://pastie.org/756094
It does appear, that lines 21 -36 in that pastie, is one conditional
block
Does perl not have a way to enforce some form of code consistency?
Original File, line 163 -164
}
else { #norm exists, print it
Altered file, same area:
} else { #norm exists, print it
And finally, original file, line 185 - 187
if ( time - $starttime != 0 ) { $processTime = time -
$starttime; }
else { $processTime = 1; }
&printlog( "\nTotal processing time: %d second(s)\n\n",
$processTime );
I see three different "cuddle" brace styles/assignments, and am not
sure if this is style, or methodologies in perl. Can someone who
knows more about perl elaborate with these cases?
Lines 190 - 193
if ( !$noGriplist || !$asspLog ) { &uploadgriplist(); }
&downloadgriplist();
&downloaddroplist();
Is that,,, if false of $noGriplist OR false $asspLog then toss
uploadgriplist() into the background and also run downloadgriplist()
and downloaddroplist() in the background, or do the last two griplist
commands happen regardless of the nearby condition?
if (condition) {
&uploadgriplist()
&downloadgriplist();
&downloaddroplist();
}
-- or --
if (condition) {
&uploadgriplist()
}
&downloadgriplist(); &downloaddroplist();
In general, in all of ASSP, there are many commands that start with
"do" or "no" or similar. Leaving you with (!$noGriplist).
What does that mean? If there is not not a Griplist? In these case,
I would almost lean on:
if ($noGriplist == false) {
just to make it more clear to someone who wants to get up to speed.
How about just moving away from using +/- in naming conventions, stuff
like this just requires a lot of mental thinking as you are moving
throough the code:
( $DoNotCollectRed || $DoNotCollectRedList )
-- or --
(! $DoNotCollectRed || ! $DoNotCollectRedList )
-- or --
( $DoNotCollectRed || ! $DoNotCollectRedList )
Or am I misunderstanding and for every negation of a variable, there
is a positive counterpart, so for the "DoNotCollectRed" there is also
a "DoCollectRed" ?
This leads me, to in rebuildspamdb.pl, I am looking for suggested ways
to inject logging just for std out testing. I want to know which
message it is working on, and find what is is getting stuck on. Once
the stuck is fixed, then I believe there is already some form of a
timer in place, of the time spent on one message is more than a fixed
time, that message can be moved into a "suspect" directory. A small
modification to rebuildspampb.pl could take an argument, ./
rebuildspamdb.pl suspect/* and take it from there, until that script
is bulletproof.
I don't want to fix just this one message, but make a system where
other messages do not hang the system, up and bad messages can time out,
One other thing I noticed, my report, and others from other servers,
not always easy to get to:
Total processing time: 6681 second(s)
/usr/local/assp/spam
File Count: 27,752
/usr/local/assp/notspam
File Count: 29,727
Imported Files: 29,727
Finished in 798 second(s)
HELO Blacklist: 596 HELOs
What is HELO blacklist doing? Looking over a non ASSP server, and a
few others as well, but all in same percentage numbers, thats really
low:
I do the bulk of my spam processing work in the ehlo/helo phase, with
DNSBL coming in just before that for outright DNS blocking. My logic
is all about CPU. DNS lookup is pre ehlo/helo, at a connection level,
so I can avoid almost all CPU hit by doing that test first. If it is a
DNSWL (whitelist), I can then also pass the entire message with no
more CPU penalty.
Next, I do my ehlo/helo checks. I can not get the first hit date
easily, but I can say that rule:
ehlo/helo does not contain "." has blocked 17 million emails, maybe 2
false possitives, but I sinply told the postmaster how to fix it, and
they were good.
Next, is ( and keep in mind, I have a weaker regex engine that ASSP on
this server ) *No Or, AND, or boolean support, full greedy.
helo/ehlo matches pattern [a-zA-Z_-]*[0-9]{1,3}-[0-9]{1,3}-[0-9]{1,3}-
[0-9]{1,3}.*
then blacklist host - 864,000 hits. This finds what looks like
dynamic address space. I block outright. tTese are all the IP's that
should be in a DNSBL, but for some reason, are not.
The other great one is ehlo/helo starts with "[" or ends with "]" does
not matter, 400,000 hits. You can debate about this one, but it has
not ever caused me any false positives.
A few more:
helo/ehlo...
ends wit .lan - 53,000
is my ip subet - 15,000
is yahoo.com, aol.com, google.com, microsoft.com, hostmail.com, etc -
about 15,000 each host,, I have a ton of hosts.
There are a ton of good hosts to hit, stuff like these are a huge win,
and every ISP has a range of cable, DSL, or modems - [0-9-]+
\.subnet[0-9-]+\.[^.]+\.telkom\.net\.id
contains .dsl. .ppoe. .ppp. .cable. etc etc
Little more risky, you can broaden those rules to include more,
something like this is much safer:
a?dsl[-.]?dynamic[-.].*
These are where you will decide to block outright, or pass them off
for weighted processing. I just outright block, never had an issue,
there are so many forged ehlo/helo patterns, and you can usually
explode those out into a /24 at least better.
A few other things, certain file attachmenys are for sure a no no.
Also, if you have a custmer who lezves, and should not get any emai
agin, I would not honeypot them, but you can Refuse the Connection
very high uo in the chainm reducug CPU greatly.
The regex;s are so simple, that even more obscure blocks:
ehlo/helo is chello.nl, since `dig mx chello.nl`
Then there are just moron hosts that have not opped 25 non auth on
tjeeer
[0-9]+\.user\.veloxzone\.com\.br
I am just looking to have time stared on a message, time spent on a
message, headers, body, off characters, and path to the message.
running timed execution could reveail help narow down the filer, sse
whayt is goig on. If you vsn sfdd regrc pstersnd, bryyrtl rbrnsd
--
Scott * If you contact me off list replace talklists@ with scott@ *
> Someone explain this to me like I am a 2 year old...
> Where *and how* do I download rebuildspamdb.pl for v2?
>
> Thanks,
> --
> Scott * If you contact me off list replace talklists@ with scott@ *
>
>> Paul, next time it hangs can you look at the rebuildrun.txt while
>> it's
>> hanging? This is the tail of mine and oddly it didn't complete the
>> full
>> string before stopping.. Very odd. Wish I knew which email it was
>> processing at the time.
>>
>> plain ~]# tail -f /usr/local/assp/rebuildrun.txt
>> remove /usr/local/assp/spam/6973.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/7013.eml Regex:Red 'Windows-1251'
>> remove /usr/local/assp/spam/7130.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/7317.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/7613.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/7966.eml Regex:Red 'bounce'
>> remove /usr/local/assp/spam/80.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/8073.eml WhiteList:
>> '[email protected]'
>> remove /usr/local/assp/spam/8354.eml Regex:Red 'Windows-1251'
>> remove /usr/lo
>
>
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast
> and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Assp-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/assp-user
------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user