I had no idea that your installation was on Windows. I always assumed, for all these years, that you were 100% Linux!
I've got a corpus size of 30k for each folder, So something like 65k files after spam, notspam, and a periodically manually trimmed errors notspam and spam folder. So we're about even on per message speed. I bet most of my slowness is disk access. Running in a VM with a 4 disk RAID 10 array of 7200rpm SAS disks. I've always wanted to move to SSD, but the charity's got no budget for that :( Thanks for the info on the rebuild thread. The logic is simple, but consists of complicated and intensive tasks. I'm very curious to see how my rebuild time is improved. I hope to try this weekend or next. I've never run a windows process that uses more than 3, maybe 4 gb of ram, and that's usually when there's something going wrong. I'm a little worried about having the first 10k of 65k messages in memory all of the time, but we'll see!! THANK YOU On Tue, Nov 10, 2020 at 2:38 AM Thomas Eckardt <thomas.ecka...@thockar.com> wrote: > My system processes ~21.000 files per rebuild and this takes 2min 23 > seconds, using the FileModel (~13min without). > Generating both DB's and storing them in to MySQL takes ~11min.. > > Nov-10-20 04:00:12 RebuildSpamDB-thread rebuildspamdb-version 8.03 started > in ASSP version 2.6.4(20310) > Nov-10-20 04:00:12 Processing... errors/spam with 2,814 files > Nov-10-20 04:00:25 Processing... errors/notspam with 1,126 files > Nov-10-20 04:00:31 Processing... spam with 9,355 files > Nov-10-20 04:01:24 Processing... notspam with 7,712 files > Nov-10-20 04:02:23 Generating weighted Bayesian tuplets > Nov-10-20 04:02:39 start populating Spamdb with 2,163,714 records! > Nov-10-20 04:05:11 Bayesian Pairs: 2,163,714 now in list > Nov-10-20 04:05:15 Generating consolidated Hidden-Markov-Model database > from 9,786,151 record model > Nov-10-20 04:06:15 HMM sequences: 4,789,916 now in list > Nov-10-20 04:06:16 generating Spamdb.helo records from 8,629 collected > HELO's > Nov-10-20 04:06:17 start populating Hidden Markov Model with 4,789,916 > records! > Nov-10-20 04:13:14 Finished populating Hidden Markov Model. HMM-check is > now enabled again! > Nov-10-20 04:13:14 Total processing time: 782 second(s) > Nov-10-20 04:13:14 Total processing data: 846.49 MByte > Nov-10-20 04:13:14 Rebuild processed 160.34 files per second. > Nov-10-20 04:13:17 Trashlist was saved to c:/assp/trashlist.db > > The system (on windows 2016) holds as much as possible in RAM, this takes > 6.5 GB max and 3 GB avg. while running normaly - 1GB after startup. > > assp-process-memory: current: 3064 MB min: 960 MB max > 3910 MB > max is 6.534 MB while the rebuild runs > > > emails (over 20mb) ... Does that impact in-memory usage > > Not really. > Since ever, 'MaxBytes' of the body of each file are processed and > attachments are hashed. > > >I'm always worried about stability, > > there is no big difference in stability between windows and nix for assp > on relative small installations. To keep every thing well, my assp restarts > once in a week. > > >having the emails in memory > > Only the rebuild thread (10001) holds 'something' in memory (or > BerkeleyDB) while the rebuild runs - but these are not files, this is the > FileModel, which is something like a memory image for each file, after it > was processed by the rebuild. > Processing a file is: analyse the header, collect helos, decode MIME, > convert to UTF-8, remove disclamer, hash attachments, collect words, > unicode normalize words, stem words .... - which is very resource and time > consuming. > Instead of processing every file, the rebuild uses the FileModel content > of the file (if available). > > Thomas > > > > > > > Von: "K Post" <nntp.p...@gmail.com> > An: "ASSP development mailing list" < > assp-test@lists.sourceforge.net> > Datum: 09.11.2020 20:40 > Betreff: Re: [Assp-test] fixes in assp 2.6.4 *SPAM-Evaporator* > build 20310 > ------------------------------ > > > > Thanks for the additional information. The in-memory bit keeps the entire > corpus in memory, not just the current day's new mail! That makes more > sense. So after a restart of the service, we should expect a slower (same > speed as previous versions) rebuild right? > > Other than the speed of the rebuild (which takes about 45 minutes on my > 30,000k installation fyi), is there any other benefit to the installation > by having the emails in memory? > > We do get some pretty big emails (over 20mb) many times a day, and I store > the whole thing so we can resend if necessary. Does that impact in-memory > usage or does just the first x kb get stored - what's needed for the > rebuild? > > My memory usage is usually around 1500mb, but it grows over time, > sometimes to 3gb or more. Do you see this same growth? Mind you, I'm on > Windows, not *nix. I'm always worried about stability, but it would be > nice to have a super fast rebuild.... > > On Mon, Nov 9, 2020 at 11:17 AM Thomas Eckardt < > *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote: > If the required information for a eml-file is not found in the FileModel, > the file is processed normaly and the info for this file is added to the > FileModel. > Information for no longer existing files are removed from the FileModel. > > >2GB: > this depends on the file count and MaxBytes (and some others) - it can be > much more or much less. I expect 2GB for 30.000 files - a wild guess. > My system requires ~ 1.2GB RAM for 20.000 files. > > >Does the new *rebuildspam.pm* <http://rebuildspam.pm/> improve rebuild > time without enabling RebuildUsesFileModel? > > No. > > Thomas > > > > > > > Von: "K Post" <*nntp.p...@gmail.com* <nntp.p...@gmail.com>> > An: "ASSP development mailing list" < > *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>> > Datum: 09.11.2020 16:59 > Betreff: Re: [Assp-test] fixes in assp 2.6.4 *SPAM-Evaporator* > build 20310 > ------------------------------ > > > > Exciting changes. Thank you. > On RebuildUsesFileModel, if ASSP crashes or the system is restarted, the > RAM storage is obviously lost. Does the new process revert to the old > method so that all messages from that day are considered during the rebuild? > You gave 2gb as an example of additional memory for this new feature. > What kind of email volume is that based on? > Does the new *rebuildspam.pm* <http://rebuildspam.pm/> improve rebuild > time without enabling RebuildUsesFileModel? > > Thanks again > Ken > > > > On Thu, Nov 5, 2020 at 4:48 AM Thomas Eckardt < > *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote: > H all, > > fixed in assp 2.6.4 *SPAM-Evaporator* build 20310: > > - trailing digits in the hostname (like '*mx.microsoft.com* > <http://mx.microsoft.com/> 1') in ARC-header lines were leading in to a > 'notmatch' for trusted forwarder definitions > > > > changed: > > - The *rebuildspamdb.pm* <http://rebuildspamdb.pm/> module is upgraded to > version 8.03. It provides faster rebuild processing, and much shorter > locking times for HMMdb and SpamDB. > > - performance improvement for the import/export database feature > > - if email addresses and IP-addresses are managed using the GUI, a given > reason and the date are written to the comment of the modified line > > - improved MIME-header fixup for missing boundary definitions > > - improved database cache handling > > > added: > > 'RebuildUsesFileModel','Build a Model from all processed emails for faster > processing' > > The rebuild task builds a content model (in memory or BerkelyDB only) of > all processed files, and uses this model at the next rebuild for faster > processing. > The time to process the mail-files is reduced down to a tenth (if > BerkeleyDB is not used ( useDB4Rebuild OFF )), but requires a large amount > of additional memory - eg. 2GB. > The time to process the mail-files is reduced to a half, if BerkeleyDB is > used ( useDB4Rebuild ON ). > The default setting is ON > The first rebuild after setting this to ON will run at a normal speed - > all the next rebuild tasks will run faster. > > > > Thomas > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > _______________________________________________ > Assp-test mailing list > *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net> > *https://lists.sourceforge.net/lists/listinfo/assp-test* > <https://lists.sourceforge.net/lists/listinfo/assp-test> > _______________________________________________ > Assp-test mailing list > *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net> > *https://lists.sourceforge.net/lists/listinfo/assp-test* > <https://lists.sourceforge.net/lists/listinfo/assp-test> > > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > _______________________________________________ > Assp-test mailing list > *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net> > *https://lists.sourceforge.net/lists/listinfo/assp-test* > <https://lists.sourceforge.net/lists/listinfo/assp-test> > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test > > > > > DISCLAIMER: > ******************************************************* > This email and any files transmitted with it may be confidential, legally > privileged and protected in law and are intended solely for the use of the > individual to whom it is addressed. > This email was multiple times scanned for viruses. There should be no > known virus in this email! > ******************************************************* > > _______________________________________________ > Assp-test mailing list > Assp-test@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/assp-test >
_______________________________________________ Assp-test mailing list Assp-test@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/assp-test