Hello, Kjartan. You wrote in <mid:[EMAIL PROTECTED]>
KÁ> That is the same method as used in the new Mozilla, right? I don't know which method Mozilla uses, sorry... KÁ> You say you are about to finish this project. When can we expect the KÁ> first version to be born? First version is to be born very-very nearly. I finished base-generating machine and now the simple task is just to build the filter itself from already written modules. For a moment I finished the utility which can open and "parse" mailbases, make and save frequency dictionaries and make a "regarding base" - exactly one which will be used in filter itself. This utility can be downloaded at http://klirik.narod.ru/arc/baesyan.exe (file size 294912 bytes) This is not installer, it is file itself. Let me tell about some features of this machine. The method I select is the approach of Paul Graham ("A Plan for Spam") partically mixed with his "Better Bayesian filtering". I regard whole raw letter - including RFC headers - as Paul Graham does - and make a frequency dictionary of it. Tokens approach are depends on if a part is HTML or a plain text. If it is simple text than I use simple definition of a token. If it is HTML than I scan also encoded URLs and bogus HTML comments. I distinguish between headers and body tokens - like Paul Graham in "Better Bayesian filtering", but I classify a token not to be in category "to", "from", "subject" or other parts but simple as "body" and "headers". All tokens from headers go to freq dictionary with the prefix "_h " - for example, "-h Enlarge". All bogus HTML comments go to freq dictionary as the token "_s spam". Attachments is regarded as a token for each, for example "_F jpeg<SZ00004f50:CC00138de6>", where prefix "_F " means "file", "jpeg" is content-subtype, and in angle brackets are size and CRC prefixed by SZ and CC. Base-64 and Quoted-Printable decoded before processing. Also I realized some locale features - because I am russian and it is actual for me and for mail I receive: I read from The Bat! registry the XLT tables and apply them to decoded text when it necessary (for russian it happens often because it is as minimum two popular encodings - win-1251 and koi-8r. So, these cases I hold correct). Also sometimes spammers are change some national letters in the words to the looking-like letters from English (like p,e,a,c and so on). I also hold this case - and it is very simple to hold it also for other languages. More specific features: This machine works with direct mailbases of The Bat! (files *.tbb). I select this format because it is better then simple *.eml when you work with big corpus of letters. If you keep your attachments outside a base it is no problem. The machine will found necessary files on your disk and index it. In this machine scanning for bogus HTML comments is limited by comments consist whole from digits. But this is not limitation, this is just the option. It is also realised (but not switched on for a moment) scaning by any comment without spaces and by any comment at all (this can be not good for embedded scripts). So, what is this machine for and how to use it? By this utility you can (and will in near future) make a regard-base which is necessary for working a filter itself. The filter I'll realize very close, may be even tomorrow, so you can already make a base for it. First, create in The bat! two folders and fill one of them by spam mail and another - by your non-spam mail. Throw out all encrypted messages (or decrypt it before). Then compress both folders. Open a folder in baesyan.exe by pressing "Open a base". If the file is OK, the button "Parse mailbase" will be enabled. Press it. The process of building the frequency dictionary for whole mailbase will begin. When a whole base will be parsed, you'll see the number of parsed letters and information about current dictionary. Now if you want you can open other base and parse it also - all results will be accumulated in current dictionary. If you want, you can view current dictionary by pressing "Show dict" but let you know that it can take a long time especially if there are many letters currently parsed. Finally you can (and must) save current dictionary by pressing "Save dictionary". Also you can open previously saved dictionary by "Read dictionary". When the current dictionary is not empty you can assign it as a "Spam" or "Non-spam" dictionary for further building of regarding base. When both of these dictionary is assigned, you can generate regarding base by pressing "Yes!". WARNING! When you assign current dictionary as Spam or Non-Spam it is no more possible to save it or continue to use it as a current dictionary! So, save non-yet-ready dictionaries before experimenting. After finishing of building regarding base you will see on the screen it's size in words. In this version of machine you must immediately save the regard base by pressing "Save regards". Because of small error (I apologize for it) only in this case you will take correct size of base in output file (if, for example, you press key "Show" or save it one time before, then in a file regard base will be written with wrong size. Then, if you want, you can generate the regard base again (by pressing "Yes" again) and display it by pressing "Show". You will see the words and their "probabilities". 0.01 is clean mail, 0.99 is clean spam. The file with saved regarding base will be used in filter. Paul Graham wrote that he has corpuses for about 4000 letters in each. I don't have so many spam for a moment - for a moment I collect only 570. You - may be - too, because usually it is deleted... So, this is the reason why you can keep some spam for training. And a last word. I say that machine holds russian "partially-translitted" words. For switching this feature on you must just create in The Bat! new XLT table called "translit". In this table just replace some LATIN letters to corresponding russian (not otherwise!). The machine reads XLT tables directly from windows registry of The Bat!, and if a table with mime-tag "translit" is found, it will be used for decoding partially transliterated words. P.S. A also have no ideas about name for future filter. Any suggestions? -- Sincerely, Alexey. Using TB 1.63b7 on WinXP SP1 Corp + MUI RU, spelling by ORFO2002 mailto:[EMAIL PROTECTED] ________________________________________________ Current version is 1.62 | "Using TBDEV" information: http://www.silverstones.com/thebat/TBUDLInfo.html