Stevan Bajić wrote: > On Fri, 18 Dec 2009 03:45:48 +0100 > Frantisek Hanzlik<[email protected]> wrote: > >> Stevan Bajić wrote: >>> On Fri, 18 Dec 2009 00:58:04 +0100 >>> Frantisek Hanzlik<[email protected]> wrote: >>> >>>> Stevan Bajić wrote: >>>>> On Thu, 17 Dec 2009 18:28:58 +0100 >>>>> Frantisek Hanzlik<[email protected]> wrote: >>>>> >>>>>> I want upgrade several DSPAM installation, all of them use hash driver, >>>>>> to 3.9.0. Is there any suggestion? Is possible use old databases, or >>>>>> it is not recommended? >>>>>> >>>>> You can use old databases without issues. >>>>> >>>>> >>>>>> Maybe, because of different (better) charset decoding (important for >>>>>> me, as in Czech are used utf8, 8859-2, cp1250,.. codings) and html >>>>>> parsing in 3.9.0, there is better throw away old databases and create >>>>>> new, probably with corpus training utilizing? >>>>>> >>>>> Since you are using the Hash driver any training you would want to do >>>> > can only be on a per user basis since the Hash driver does not have >>>> > DSPAM-groups support. >>>> >>>> Hello Stevan, >>>> >>> Ahoi Frantisek, >>> >>>> how I have understand this (Hash driver does not have DSPAM-groups >>>> support) ? >>>> >>> Semi correct. Everything that involves reading more then one database/css >>> does not work with th Hash driver. >> >> Aha. Then with hash driver isn't probably possible use merged and >> classification groups and maybe inoculation group, but shared should be fine. >> > Correct. > > >> >>> >>>> README says, that hash driver not support merged groups, but other are >>>> probably OK, yes? >>>> >>> I need to look deeper into the code but as far as I remember anything that >>> involves reading more then just one database/css file does not work. >>> >>> >>>> In my configurations I mailnly use "shared,managed" or >>>> "shared" groups and it work fine. >>>> >>> Shared is just using ONE single css file for a bunch of users. That should >>> work with the Hash driver. >>> >>> >>>> Or isn't possible use dspan-train script for DSPAM pre-training? >>>> >>> Yes, yes. It is possible to use the dspam_train script to pretrain the Hash >>> driver. >>> >>> >>>> And, in dspam sources is scripts/train.pl script, for which purposes is it? >>>> >>> That is an older version of dspam_train that is far, far, far behind the >>> current >> > dspam_train in terms of functionality and in terms of used DSPAM >> functions (for >> > example it does not handle blocklist, blacklist, etc). You can use that >> script if >> > you want or use dspam_train or make your own training script. I for >> example >> use my >> > own made script that is using TONE (Train on Error or Near Error) with >> additional >> > features like asymetric treshold/thickness for the spam/ham training, >> double side >> > training (this is essencial for the Hyperspace classifier in CRM114 and >> I >> find that >> > idea good so I implemented it into my training script as well), etc... >> Most >> of the >> > ideas about how to train the correct way came up after using >> CRM114/OSBF-Lua for >> > many years. My script is as well by factors faster then the original >> dspam_train >> > since I don't use signature based training (so I don't need to purge >> signatures after >> > a long training run) and other small things that I need because I use >> the >> script to >> > feed fresh data to my DSPAM instance that I have captured on my SPAM >> honeypot. >> > I needed that additional functionallity because all training is done >> automatic without >> > my own intervention and I need the script to be rock solid and to >> continue >> running even >> > if some mails are producing erros in DSPAM while doing the training. >> > Currently I have the following options: >>> ---------------------------------------------------------------- >>> theia spam-stuff # ./dspam_train_tone_v5 --help >>> ERROR: spam corpus must be path to maildir directory or MBOX file. >>> >>> Usage: ./dspam_train_tone_v5 >>> [[username]|[--user username]] User name to use for training >>> [--client] To run in client mode >>> [--random] Randomly process corpi >>> [--refute] To unlearn errors from opposite class >>> [--subject] To show subject from error/unlearn/TONE >>> [--max-retrain max_retrain] Maximum relearns per error/TONE >>> [--spam-threshold threshold] TONE Spam threshold >>> [--ham-threshold threshold] TONE Ham threshold >>> [--overleap count] Overleap certain count of messages >>> [--stop-after count] Stop after processed certain count of >>> messages >>> [[-i index]|[spam_dir] [nonspam_dir]] >>> >>> theia spam-stuff # >>> ---------------------------------------------------------------- >> >> Eh, I must admit, I not well understand all of these finest theory. >> > Sorry. I assumed you understand that. > > >>>>> I would say that you should keep the old databases and run daily the >>>> > clean process (cssclean/csscompress) to purge old tokens from the >>>> database. >>>> > Soon or later the old unused tokens will vanish from the database >>>> and you >>>> > will only have new tokens. >>>>> >>>>> As soon as you use 3.9.0 your users will benefit from the different >>>>> (better) >>>> > charset decoding and html parsing. Purging/removing the database >>>> will not >>>> > affect that capability in any negative nor in any positive way. >>>>> >>>> >>>> Well, I understand. I wanted try pre-train dspam from prepared spam and ham >>>> corpus, as I expect slightly better accuracy in addition to start with >>>> 3.9.0-fine CSS, especially on lazy users, which not train dspam fairly. >>>> >>> Then you should definatley use TOE or TUM but NOT TEFT. I mean in >>> production. >> > For training you can use whatever you think is best for you. >> >> Yes, after some training I commonly switch to TOE. README suggest it too, >> when >> there are doing databases cleanings. >> > Do that switch BEFORE doing database cleanups. Then do the cleanup. > > >>>> Sorry for my terrible english. >>>> >>> Není žádný problém >> >> Yes, I know, not for You, but for me yes. But at least so. >> When we touch it - not know when You registerd it, I sent before yesterday >> Czech webui translation via bugtracker system. >> > I have seen it. I wanted to translate the Czech characters to be in HTML > Unicode encoding but then give up because of lack of time. > > Would be nice if you could do that. Stuff like: > ř -> ř > Ř -> Ř > Ž -> Ž > á -> á or -> á > ý -> ý > é -> É or -> é > ů -> ů > > etc... > > That would make the maintaining of the HTML files easier since everyone > having a ASCII capable editor could edit the files without destroying > anything. > > >> What is untranslated (beside >> some shorcuts in nav_admin_user.html, >> > It's more than that. You have not translated the legend on the graphs as > well. Look for example in nav_admin_status.html: > <img src="./admingraph.cgi?data=$DATA_DAILY$&x_label=Hour+of+the+day" /> > > That x_label can be translated in Czech as well. Just use + for a space. > > In nav_analysis.html you have translated it: > <img src="./graph.cgi?data=$DATA_DAILY$&x_label=Hodina+dne" /> > > The reason to use those&#xNNNN; HTML codings is that the Web UI does NOT > have an character encoding set in the header and depending what encoding > one has active the characters could show up totally wrong. So it's better > to use those&#xNNNN; HTML tags to let the browser know what character you > want to display. And as said before: It's easier for others to maintain > the HTML files without risking to distroy them.
Agree. I just upload more complete and to HTML entities converted files. Conversion was done with recode program: for i in *.html *.pl; do recode --diacritics u8..HTML <$i>C/$i; done result seems be OK. >> which is probably better leave it as is) >> is button "Tweak -1" in nav_performance.html. Can You please briefly explain >> its function? >> > Often you correct a error from the command line or in other ways and then > tweak -1 allows you to change the statistics directly from with in the Web > UI. John has described this the following way: > Q. What is TWEAK -1? > A. In the CGI, a button labeled "Tweak -1" exists. If you are anal about > keeping accurate web stats as I am, you want to make sure that messages > you forward in that are NOT spam don't get counted against the web stats. > For example, I forward in virus-ridden emails and the occasional completely > blank message - neither of which DSPAM is expected to catch. Clicking > "Tweak -1" for each of these emails I send in corrects the web stats so > as not to count them against DSPAM's accuracy. That's all it is! Thanks, Franta ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
