Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Stevan Bajić Thu, 17 Dec 2009 19:25:10 -0800

On Fri, 18 Dec 2009 03:45:48 +0100
Frantisek Hanzlik <[email protected]> wrote:


> Stevan Bajić wrote:
> > On Fri, 18 Dec 2009 00:58:04 +0100
> > Frantisek Hanzlik<[email protected]>  wrote:
> >
> >> Stevan Bajić wrote:
> >>> On Thu, 17 Dec 2009 18:28:58 +0100
> >>> Frantisek Hanzlik<[email protected]>   wrote:
> >>>
> >>>> I want upgrade several DSPAM installation, all of them use hash driver,
> >>>> to 3.9.0. Is there any suggestion? Is possible use old databases, or
> >>>> it is not recommended?
> >>>>
> >>> You can use old databases without issues.
> >>>
> >>>
> >>>> Maybe, because of different (better) charset decoding (important for
> >>>> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html
> >>>> parsing in 3.9.0, there is better throw away old databases and create
> >>>> new, probably with corpus training utilizing?
> >>>>
> >>> Since you are using the Hash driver any training you would want to do
> >>   >  can only be on a per user basis since the Hash driver does not have
> >>   >  DSPAM-groups support.
> >>
> >> Hello Stevan,
> >>
> > Ahoi Frantisek,
> >
> >> how I have understand this (Hash driver does not have DSPAM-groups 
> >> support) ?
> >>
> > Semi correct. Everything that involves reading more then one database/css 
> > does not work with th Hash driver.
> 
> Aha. Then with hash driver isn't probably possible use merged and 
> classification groups and maybe inoculation group, but shared should be fine.
> 
Correct.


> 
> >
> >> README says, that hash driver not support merged groups, but other are
> >> probably OK, yes?
> >>
> > I need to look deeper into the code but as far as I remember anything that 
> > involves reading more then just one database/css file does not work.
> >
> >
> >> In my configurations I mailnly use "shared,managed" or
> >> "shared" groups and it work fine.
> >>
> > Shared is just using ONE single css file for a bunch of users. That should 
> > work with the Hash driver.
> >
> >
> >> Or isn't possible use dspan-train script for DSPAM pre-training?
> >>
> > Yes, yes. It is possible to use the dspam_train script to pretrain the Hash 
> > driver.
> >
> >
> >> And, in dspam sources is scripts/train.pl script, for which purposes is it?
> >>
> > That is an older version of dspam_train that is far, far, far behind the 
> > current
>  > dspam_train in terms of functionality and in terms of used DSPAM functions 
> (for
>  > example it does not handle blocklist, blacklist, etc). You can use that 
> script if
>  > you want or use dspam_train or make your own training script. I for 
> example 
> use my
>  > own made script that is using TONE (Train on Error or Near Error) with 
> additional
>  > features like asymetric treshold/thickness for the spam/ham training, 
> double side
>  > training (this is essencial for the Hyperspace classifier in CRM114 and I 
> find that
>  > idea good so I implemented it into my training script as well), etc... 
> Most 
> of the
>  > ideas about how to train the correct way came up after using 
> CRM114/OSBF-Lua for
>  > many years. My script is as well by factors faster then the original 
> dspam_train
>  > since I don't use signature based training (so I don't need to purge 
> signatures after
>  > a long training run) and other small things that I need because I use the 
> script to
>  > feed fresh data to my DSPAM instance that I have captured on my SPAM 
> honeypot.
>  > I needed that additional functionallity because all training is done 
> automatic without
>  > my own intervention and I need the script to be rock solid and to continue 
> running even
>  > if some mails are producing erros in DSPAM while doing the training.
>  > Currently I have the following options:
> > ----------------------------------------------------------------
> > theia spam-stuff # ./dspam_train_tone_v5 --help
> > ERROR: spam corpus must be path to maildir directory or MBOX file.
> >
> > Usage: ./dspam_train_tone_v5
> >    [[username]|[--user username]] User name to use for training
> >    [--client]                     To run in client mode
> >    [--random]                     Randomly process corpi
> >    [--refute]                     To unlearn errors from opposite class
> >    [--subject]                    To show subject from error/unlearn/TONE
> >    [--max-retrain max_retrain]    Maximum relearns per error/TONE
> >    [--spam-threshold threshold]   TONE Spam threshold
> >    [--ham-threshold threshold]    TONE Ham threshold
> >    [--overleap count]             Overleap certain count of messages
> >    [--stop-after count]           Stop after processed certain count of 
> > messages
> >    [[-i index]|[spam_dir] [nonspam_dir]]
> >
> > theia spam-stuff #
> > ----------------------------------------------------------------
> 
> Eh, I must admit, I not well understand all of these finest theory.
> 
Sorry. I assumed you understand that.


> >>> I would say that you should keep the old databases and run daily  the
> >>   >  clean process (cssclean/csscompress) to purge old tokens from the 
> >> database.
> >>   >  Soon or later the old unused tokens will vanish from the database and 
> >> you
> >>   >  will only have new tokens.
> >>>
> >>> As soon as you use 3.9.0 your users will benefit from the different 
> >>> (better)
> >>   >  charset decoding and html parsing. Purging/removing the database will 
> >> not
> >>   >  affect that capability in any negative nor in any positive way.
> >>>
> >>
> >> Well, I understand. I wanted try pre-train dspam from prepared spam and ham
> >> corpus, as I expect slightly better accuracy in addition to start with
> >> 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly.
> >>
> > Then you should definatley use TOE or TUM but NOT TEFT. I mean in 
> > production.
>  > For training you can use whatever you think is best for you.
> 
> Yes, after some training I commonly switch to TOE. README suggest it too, when
> there are doing databases cleanings.
> 
Do that switch BEFORE doing database cleanups. Then do the cleanup.


> >> Sorry for my terrible english.
> >>
> > Není žádný problém
> 
> Yes, I know, not for You, but for me yes. But at least so.
> When we touch it - not know when You registerd it, I sent before yesterday 
> Czech webui translation via bugtracker system.
>
I have seen it. I wanted to translate the Czech characters to be in HTML 
Unicode encoding but then give up because of lack of time.

Would be nice if you could do that. Stuff like:
ř -> &#x159;
Ř -> &#x158;
Ž -> &#x17D;
á -> &#xE1; or -> &aacute;
ý -> &#xFD;
é -> &#xC9; or -> &eacute;
ů -> &#x16F;

etc...

That would make the maintaining of the HTML files easier since everyone having 
a ASCII capable editor could edit the files without destroying anything.


> What is untranslated (beside
> some shorcuts in nav_admin_user.html,
>
It's more than that. You have not translated the legend on the graphs as well. 
Look for example in nav_admin_status.html:
<img src="./admingraph.cgi?data=$DATA_DAILY$&x_label=Hour+of+the+day" />

That x_label can be translated in Czech as well. Just use + for a space.

In nav_analysis.html you have translated it:
<img src="./graph.cgi?data=$DATA_DAILY$&x_label=Hodina+dne" />

The reason to use those &#xNNNN; HTML codings is that the Web UI does NOT have 
an character encoding set in the header and depending what encoding one has 
active the characters could show up totally wrong. So it's better to use those 
&#xNNNN; HTML tags to let the browser know what character you want to display. 
And as said before: It's easier for others to maintain the HTML files without 
risking to distroy them.


> which is probably better leave it as is) 
> is button "Tweak -1" in nav_performance.html. Can You please briefly explain 
> its function?
> 
Often you correct a error from the command line or in other ways and then tweak 
-1 allows you to change the statistics directly from with in the Web UI. John 
has described this the following way:
Q. What is TWEAK -1?
A. In the CGI, a button labeled "Tweak -1" exists. If you are anal about 
keeping accurate web stats as I am, you want to make sure that messages you 
forward in that are NOT spam don't get counted against the web stats. For 
example, I forward in virus-ridden emails and the occasional completely blank 
message - neither of which DSPAM is expected to catch. Clicking "Tweak -1" for 
each of these emails I send in corrects the web stats so as not to count them 
against DSPAM's accuracy. That's all it is! 




> Thanks, Franta
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Reply via email to