Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Frantisek Hanzlik Thu, 17 Dec 2009 22:14:07 -0800

Stevan Bajić wrote:
> On Fri, 18 Dec 2009 03:45:48 +0100
> Frantisek Hanzlik<[email protected]>  wrote:
>
>> Stevan Bajić wrote:
>>> On Fri, 18 Dec 2009 00:58:04 +0100
>>> Frantisek Hanzlik<[email protected]>   wrote:
>>>
>>>> Stevan Bajić wrote:
>>>>> On Thu, 17 Dec 2009 18:28:58 +0100
>>>>> Frantisek Hanzlik<[email protected]>    wrote:
>>>>>
>>>>>> I want upgrade several DSPAM installation, all of them use hash driver,
>>>>>> to 3.9.0. Is there any suggestion? Is possible use old databases, or
>>>>>> it is not recommended?
>>>>>>
>>>>> You can use old databases without issues.
>>>>>
>>>>>
>>>>>> Maybe, because of different (better) charset decoding (important for
>>>>>> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html
>>>>>> parsing in 3.9.0, there is better throw away old databases and create
>>>>>> new, probably with corpus training utilizing?
>>>>>>
>>>>> Since you are using the Hash driver any training you would want to do
>>>>    >   can only be on a per user basis since the Hash driver does not have
>>>>    >   DSPAM-groups support.
>>>>
>>>> Hello Stevan,
>>>>
>>> Ahoi Frantisek,
>>>
>>>> how I have understand this (Hash driver does not have DSPAM-groups 
>>>> support) ?
>>>>
>>> Semi correct. Everything that involves reading more then one database/css 
>>> does not work with th Hash driver.
>>
>> Aha. Then with hash driver isn't probably possible use merged and
>> classification groups and maybe inoculation group, but shared should be fine.
>>
> Correct.
>
>
>>
>>>
>>>> README says, that hash driver not support merged groups, but other are
>>>> probably OK, yes?
>>>>
>>> I need to look deeper into the code but as far as I remember anything that 
>>> involves reading more then just one database/css file does not work.
>>>
>>>
>>>> In my configurations I mailnly use "shared,managed" or
>>>> "shared" groups and it work fine.
>>>>
>>> Shared is just using ONE single css file for a bunch of users. That should 
>>> work with the Hash driver.
>>>
>>>
>>>> Or isn't possible use dspan-train script for DSPAM pre-training?
>>>>
>>> Yes, yes. It is possible to use the dspam_train script to pretrain the Hash 
>>> driver.
>>>
>>>
>>>> And, in dspam sources is scripts/train.pl script, for which purposes is it?
>>>>
>>> That is an older version of dspam_train that is far, far, far behind the 
>>> current
>>   >  dspam_train in terms of functionality and in terms of used DSPAM 
>> functions (for
>>   >  example it does not handle blocklist, blacklist, etc). You can use that
>> script if
>>   >  you want or use dspam_train or make your own training script. I for 
>> example
>> use my
>>   >  own made script that is using TONE (Train on Error or Near Error) with
>> additional
>>   >  features like asymetric treshold/thickness for the spam/ham training,
>> double side
>>   >  training (this is essencial for the Hyperspace classifier in CRM114 and 
>> I
>> find that
>>   >  idea good so I implemented it into my training script as well), etc... 
>> Most
>> of the
>>   >  ideas about how to train the correct way came up after using
>> CRM114/OSBF-Lua for
>>   >  many years. My script is as well by factors faster then the original
>> dspam_train
>>   >  since I don't use signature based training (so I don't need to purge
>> signatures after
>>   >  a long training run) and other small things that I need because I use 
>> the
>> script to
>>   >  feed fresh data to my DSPAM instance that I have captured on my SPAM 
>> honeypot.
>>   >  I needed that additional functionallity because all training is done
>> automatic without
>>   >  my own intervention and I need the script to be rock solid and to 
>> continue
>> running even
>>   >  if some mails are producing erros in DSPAM while doing the training.
>>   >  Currently I have the following options:
>>> ----------------------------------------------------------------
>>> theia spam-stuff # ./dspam_train_tone_v5 --help
>>> ERROR: spam corpus must be path to maildir directory or MBOX file.
>>>
>>> Usage: ./dspam_train_tone_v5
>>>     [[username]|[--user username]] User name to use for training
>>>     [--client]                     To run in client mode
>>>     [--random]                     Randomly process corpi
>>>     [--refute]                     To unlearn errors from opposite class
>>>     [--subject]                    To show subject from error/unlearn/TONE
>>>     [--max-retrain max_retrain]    Maximum relearns per error/TONE
>>>     [--spam-threshold threshold]   TONE Spam threshold
>>>     [--ham-threshold threshold]    TONE Ham threshold
>>>     [--overleap count]             Overleap certain count of messages
>>>     [--stop-after count]           Stop after processed certain count of 
>>> messages
>>>     [[-i index]|[spam_dir] [nonspam_dir]]
>>>
>>> theia spam-stuff #
>>> ----------------------------------------------------------------
>>
>> Eh, I must admit, I not well understand all of these finest theory.
>>
> Sorry. I assumed you understand that.
>
>
>>>>> I would say that you should keep the old databases and run daily  the
>>>>    >   clean process (cssclean/csscompress) to purge old tokens from the 
>>>> database.
>>>>    >   Soon or later the old unused tokens will vanish from the database 
>>>> and you
>>>>    >   will only have new tokens.
>>>>>
>>>>> As soon as you use 3.9.0 your users will benefit from the different 
>>>>> (better)
>>>>    >   charset decoding and html parsing. Purging/removing the database 
>>>> will not
>>>>    >   affect that capability in any negative nor in any positive way.
>>>>>
>>>>
>>>> Well, I understand. I wanted try pre-train dspam from prepared spam and ham
>>>> corpus, as I expect slightly better accuracy in addition to start with
>>>> 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly.
>>>>
>>> Then you should definatley use TOE or TUM but NOT TEFT. I mean in 
>>> production.
>>   >  For training you can use whatever you think is best for you.
>>
>> Yes, after some training I commonly switch to TOE. README suggest it too, 
>> when
>> there are doing databases cleanings.
>>
> Do that switch BEFORE doing database cleanups. Then do the cleanup.
>
>
>>>> Sorry for my terrible english.
>>>>
>>> Není žádný problém
>>
>> Yes, I know, not for You, but for me yes. But at least so.
>> When we touch it - not know when You registerd it, I sent before yesterday
>> Czech webui translation via bugtracker system.
>>
> I have seen it. I wanted to translate the Czech characters to be in HTML 
> Unicode encoding but then give up because of lack of time.
>
> Would be nice if you could do that. Stuff like:
> ř ->  &#x159;
> Ř ->  &#x158;
> Ž ->  &#x17D;
> á ->  &#xE1; or ->  &aacute;
> ý ->  &#xFD;
> é ->  &#xC9; or ->  &eacute;
> ů ->  &#x16F;
>
> etc...
>
> That would make the maintaining of the HTML files easier since everyone 
> having a ASCII capable editor could edit the files without destroying 
> anything.
>
>
>> What is untranslated (beside
>> some shorcuts in nav_admin_user.html,
>>
> It's more than that. You have not translated the legend on the graphs as 
> well. Look for example in nav_admin_status.html:
> <img src="./admingraph.cgi?data=$DATA_DAILY$&x_label=Hour+of+the+day" />
>
> That x_label can be translated in Czech as well. Just use + for a space.
>
> In nav_analysis.html you have translated it:
> <img src="./graph.cgi?data=$DATA_DAILY$&x_label=Hodina+dne" />
>
> The reason to use those&#xNNNN; HTML codings is that the Web UI does NOT
 > have an character encoding set in the header and depending what encoding
 > one has active the characters could show up totally wrong. So it's better
 > to use those&#xNNNN; HTML tags to let the browser know what character you
 > want to display. And as said before: It's easier for others to maintain
 > the HTML files without risking to distroy them.


Agree. I just upload more complete and to HTML entities converted files.
Conversion was done with recode program:
for i in *.html *.pl; do recode --diacritics u8..HTML <$i>C/$i; done

result seems be OK.

>> which is probably better leave it as is)
>> is button "Tweak -1" in nav_performance.html. Can You please briefly explain
>> its function?
>>
> Often you correct a error from the command line or in other ways and then
 > tweak -1 allows you to change the statistics directly from with in the Web
 > UI. John has described this the following way:
> Q. What is TWEAK -1?
> A. In the CGI, a button labeled "Tweak -1" exists. If you are anal about
 > keeping accurate web stats as I am, you want to make sure that messages
 > you forward in that are NOT spam don't get counted against the web stats.
 > For example, I forward in virus-ridden emails and the occasional completely
 > blank message - neither of which DSPAM is expected to catch. Clicking
 > "Tweak -1" for each of these emails I send in corrects the web stats so
 > as not to count them against DSPAM's accuracy. That's all it is!

Thanks, Franta

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] how about upgrading hash DB to 3.9.0 ?

Reply via email to