I had no idea that your installation was on Windows.  I always assumed, for
all these years, that you were 100% Linux!

I've got a corpus size of 30k for each folder, So something like 65k files
after spam, notspam, and a periodically manually trimmed errors notspam and
spam folder.  So we're about even on per message speed. I bet most of my
slowness is disk access.  Running in a VM with a 4 disk RAID 10 array of
7200rpm SAS disks.  I've always wanted to move to SSD, but the charity's
got no budget for that :(

Thanks for the info on the rebuild thread.  The logic is simple, but
consists of complicated and intensive tasks.  I'm very curious to see how
my rebuild time is improved.  I hope to try this weekend or next.

I've never run a windows process that uses more than 3, maybe 4 gb of ram,
and that's usually when there's something going wrong.  I'm a little
worried about having the first 10k of 65k messages in memory all of the
time, but we'll see!!

THANK YOU


On Tue, Nov 10, 2020 at 2:38 AM Thomas Eckardt <thomas.ecka...@thockar.com>
wrote:

> My system processes ~21.000 files per rebuild and this takes 2min 23
> seconds, using the FileModel (~13min without).
> Generating both DB's and storing them in to MySQL takes ~11min..
>
> Nov-10-20 04:00:12 RebuildSpamDB-thread rebuildspamdb-version 8.03 started
> in ASSP version 2.6.4(20310)
> Nov-10-20 04:00:12 Processing... errors/spam with 2,814 files
> Nov-10-20 04:00:25 Processing... errors/notspam with 1,126 files
> Nov-10-20 04:00:31 Processing... spam with 9,355 files
> Nov-10-20 04:01:24 Processing... notspam with 7,712 files
> Nov-10-20 04:02:23 Generating weighted Bayesian tuplets
> Nov-10-20 04:02:39 start populating Spamdb with 2,163,714 records!
> Nov-10-20 04:05:11 Bayesian Pairs: 2,163,714 now in list
> Nov-10-20 04:05:15 Generating consolidated Hidden-Markov-Model database
> from 9,786,151 record model
> Nov-10-20 04:06:15 HMM sequences: 4,789,916 now in list
> Nov-10-20 04:06:16 generating Spamdb.helo records from 8,629 collected
> HELO's
> Nov-10-20 04:06:17 start populating Hidden Markov Model with 4,789,916
> records!
> Nov-10-20 04:13:14 Finished populating Hidden Markov Model. HMM-check is
> now enabled again!
> Nov-10-20 04:13:14 Total processing time: 782 second(s)
> Nov-10-20 04:13:14 Total processing data: 846.49 MByte
> Nov-10-20 04:13:14 Rebuild processed 160.34 files per second.
> Nov-10-20 04:13:17 Trashlist was saved to c:/assp/trashlist.db
>
> The system (on windows 2016) holds as much as possible in RAM, this takes
> 6.5 GB max and 3 GB avg. while running normaly - 1GB after startup.
>
> assp-process-memory:        current: 3064 MB        min: 960 MB        max
> 3910 MB
> max is 6.534 MB while the rebuild runs
>
> > emails (over 20mb)  ... Does that impact in-memory usage
>
> Not really.
> Since ever, 'MaxBytes' of the body of each file are processed and
> attachments are hashed.
>
> >I'm always worried about stability,
>
> there is no big difference in stability between windows and nix for assp
> on relative small installations. To keep every thing well, my assp restarts
> once in a week.
>
> >having the emails in memory
>
> Only the rebuild thread (10001) holds 'something' in memory (or
> BerkeleyDB) while the rebuild runs - but these are not files, this is the
> FileModel, which is something like a memory image for each file, after it
> was processed by the rebuild.
> Processing a file is: analyse the header, collect helos, decode MIME,
> convert to UTF-8, remove disclamer, hash attachments, collect words,
> unicode normalize words, stem words .... - which is very resource and time
> consuming.
> Instead of processing every file, the rebuild uses the FileModel content
> of the file (if available).
>
> Thomas
>
>
>
>
>
>
> Von:        "K Post" <nntp.p...@gmail.com>
> An:        "ASSP development mailing list" <
> assp-test@lists.sourceforge.net>
> Datum:        09.11.2020 20:40
> Betreff:        Re: [Assp-test] fixes in assp 2.6.4 *SPAM-Evaporator*
> build 20310
> ------------------------------
>
>
>
> Thanks for the additional information.  The in-memory bit keeps the entire
> corpus in memory, not just the current day's new mail!  That makes more
> sense.  So after a restart of the service, we should expect a slower (same
> speed as previous versions) rebuild right?
>
> Other than the speed of the rebuild (which takes about 45 minutes on my
> 30,000k installation fyi), is there any other benefit to the installation
> by having the emails in memory?
>
> We do get some pretty big emails (over 20mb) many times a day, and I store
> the whole thing so we can resend if necessary.  Does that impact in-memory
> usage or does just the first x kb get stored - what's needed for the
> rebuild?
>
> My memory usage is usually around 1500mb, but it grows over time,
> sometimes to 3gb or more.  Do you see this same growth?  Mind you, I'm on
> Windows, not *nix.  I'm always worried about stability, but it would be
> nice to have a super fast rebuild....
>
> On Mon, Nov 9, 2020 at 11:17 AM Thomas Eckardt <
> *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote:
> If the required information for a eml-file is not found in the FileModel,
> the file is processed normaly and the info for this file is added to the
> FileModel.
> Information for no longer existing files are removed from the FileModel.
>
> >2GB:
> this depends on the file count and MaxBytes (and some others) - it can be
> much more or much less. I expect 2GB for 30.000 files - a wild guess.
> My system requires ~ 1.2GB RAM for 20.000 files.
>
> >Does the new *rebuildspam.pm* <http://rebuildspam.pm/> improve rebuild
> time without enabling RebuildUsesFileModel?
>
> No.
>
> Thomas
>
>
>
>
>
>
> Von:        "K Post" <*nntp.p...@gmail.com* <nntp.p...@gmail.com>>
> An:        "ASSP development mailing list" <
> *assp-test@lists.sourceforge.net* <assp-test@lists.sourceforge.net>>
> Datum:        09.11.2020 16:59
> Betreff:        Re: [Assp-test] fixes in assp 2.6.4 *SPAM-Evaporator*
> build 20310
> ------------------------------
>
>
>
> Exciting changes. Thank you.
> On RebuildUsesFileModel, if ASSP crashes or the system is restarted, the
> RAM storage is obviously lost.  Does the new process revert to the old
> method so that all messages from that day are considered during the rebuild?
> You gave 2gb as an example of additional memory for this new feature.
> What kind of email volume is that based on?
> Does the new *rebuildspam.pm* <http://rebuildspam.pm/> improve rebuild
> time without enabling RebuildUsesFileModel?
>
> Thanks again
> Ken
>
>
>
> On Thu, Nov 5, 2020 at 4:48 AM Thomas Eckardt <
> *thomas.ecka...@thockar.com* <thomas.ecka...@thockar.com>> wrote:
> H all,
>
> fixed in assp 2.6.4 *SPAM-Evaporator* build 20310:
>
> - trailing digits in the hostname (like '*mx.microsoft.com*
> <http://mx.microsoft.com/> 1') in ARC-header lines were leading in to a
> 'notmatch' for trusted forwarder definitions
>
>
>
> changed:
>
> - The *rebuildspamdb.pm* <http://rebuildspamdb.pm/> module is upgraded to
> version 8.03. It provides faster rebuild processing, and much shorter
> locking times for HMMdb and SpamDB.
>
> - performance improvement for the import/export database feature
>
> - if email addresses and IP-addresses are managed using the GUI, a given
> reason and the date are written to the comment of the modified line
>
> - improved MIME-header fixup for missing boundary definitions
>
> - improved database cache handling
>
>
> added:
>
> 'RebuildUsesFileModel','Build a Model from all processed emails for faster
> processing'
>
>  The rebuild task builds a content model (in memory or BerkelyDB only) of
> all processed files, and uses this model at the next rebuild for faster
> processing.
>  The time to process the mail-files is reduced down to a tenth (if
> BerkeleyDB is not used ( useDB4Rebuild OFF )), but requires a large amount
> of additional memory - eg. 2GB.
>  The time to process the mail-files is reduced to a half, if BerkeleyDB is
> used ( useDB4Rebuild ON ).
>  The default setting is ON
>  The first rebuild after setting this to ON will run at a normal speed -
> all the next rebuild tasks will run faster.
>
>
>
> Thomas
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> *Assp-test@lists.sourceforge.net* <Assp-test@lists.sourceforge.net>
> *https://lists.sourceforge.net/lists/listinfo/assp-test*
> <https://lists.sourceforge.net/lists/listinfo/assp-test>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to