Re: How does SpamAssassin processing languages other than English
Cool, thanks guys, i think I have a good sense of how SpamAssassin works now. we are doing some spam project, that's amazing to have SpamAssassin. --- Yu Qian Ottawa Ontario Phone: (514)-553-0198 On Wed, Apr 13, 2016 at 8:21 AM, RW wrote: > On Tue, 12 Apr 2016 14:15:50 -0400 > Dianne Skoll wrote: > > > On Tue, 12 Apr 2016 13:41:51 -0400 > > Yu Qian wrote: > > > > > Yup, that's right, it becomes difficult if we want to support > > > multiple language in one spam detection solution. and it's true > > > that there are some best practice for single language. but didn't > > > see too much support multiple > > > > The only practical approach is to normalize everything into Unicode > > and tokenize Unicode characters. (We actually use UTF-8 as the > > on-disk representation.) > > > > We have a custom Bayes engine that treats any character in the CJK > > Unified Ideographs range as a word. This is not strictly correct > > because there are two-character (and longer) CJK words, but it's close > > enough, > > What happens in mainstream SpamAssassin is that if a word is over 15 > bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in > place of the original word. Everything can be normalized to UTF-8 with > "normalize_charset 1" > > This will likely work fairly well for CJK, but won't work well for any 3 > or 4 byte UTF-8 alphabet that isn't composed of ideograms (unless > it's only in spam). This includes most Asian and African languages. > > I think the best solution to this is simply to retain the original > long-word as a token - or to allow it as an option. > > Setting normalize_charset also helps with custom rules if you edit them > as UTF-8, but it's important to remember that SA sees a multibyte > character as a sequence of bytes rather than a single charcter. For > example you can't put a non-ascii character between square brackets. >
Re: [OT] still configuring [Was: Disabling spamcop plugin]
On 2016-04-13 09:12 -0400, Michael Orlitzky wrote: > package will be recompiled automatically as part of the updates. Any > packages *depending on* that package (like, if they're statically linked > to it) will also be recompiled. But also _direct_ dependencies of the affected package, if the latest version has new requirements. And this is the heart of the problem. With a dedicated security channel like debian has, the fixes are recompiled targeted to the base release, so (for example) I'd never have to update perl because of a fix in spamassassin. In fact you can leave debian servers to update themselves unattended, most of the time. This is too huge a benefit for me to drop, even weighed against the recent debian annoyances. -- Please *no* private copies of mailing list or newsgroup messages. Rule 420: All persons more than eight miles high to leave the court.
Re: Disabling spamcop plugin
On 04/13/2016 09:50 AM, Reindl Harald wrote: > > enough problems by wasting time if you have to maintain 10, 20, 30 or > more servers and in case of problems need fast downgrades - especially > if you run virtual machines where all the compile jobs share hardware emerge --buildpkg will create a binary package that you can instantly downgrade to with emerge --usepkg > besides that on a production server no compilers should be installed at > all - the generation of malware which compiles itself is only a question > of time I'm not convinced that an attacker who can execute commands on your server is more dangerous when one of those commands is `gcc`. > > what gentoo would need to solve for professional environemnts is that > you have one machine which pulls the updates, compiles them and apckage > them in a way all other machines in the network can pull and apply them > in precompiled from over ftp, http or whatever network protocol > As you wish: https://wiki.gentoo.org/wiki/Binary_package_guide
Re: Disabling spamcop plugin
On Wed Apr 13 15:50:27 2016, Reindl Harald wrote: > enough problems by wasting time if you have to maintain 10, 20, 30 or more > servers and in case of problems need fast downgrades - especially if you run > virtual machines where all the compile jobs share hardware > > besides that on a production server no compilers should be installed at all > - the generation of malware which compiles itself is only a question of time > > what gentoo would need to solve for professional environemnts is that you > have one machine which pulls the updates, compiles them and apckage them in > a way all other machines in the network can pull and apply them in > precompiled from over ftp, http or whatever network protocol > > we are doing the same even for Fedora servers where one machine which has > all package sinstalled moves them from yum/dnf-cache to a repo folder, run > createrepo and all other machines have only this repo enabled and so can do > a "yum -y upgrade" which can be triggered over SSH directly from the admin > machine with a "distribute-updates.sh" script and a own SSH key for that > task Hi, When you run several dozens of servers, you should use and orchestrator. By this way, you don’t spend time for each server. Also, you can have a compiler for your gentoo architecture that serves binary packages to other servers. -- alarig signature.asc Description: Digital signature
Re: Disabling spamcop plugin
Am 13.04.2016 um 15:12 schrieb Michael Orlitzky: On 04/13/2016 01:26 AM, Ian Zimmerman wrote: On 2016-04-12 10:57 -0400, David Niklas wrote: You could use Gentoo, you get to configure it all yourself! Funny you'd say that, I _am_ actually switching to it - on my "workstation" role computers. I'm already over 50% over the hump, I think. But on "server type" computers, I just cannot spare a dedicated security branch. I really don't have the time, and more importantly the nerves, to scramble and recompile the world when each new vulnerability is announced. This shouldn't be worse on Gentoo than it is anywhere else. We have a mailing list, gentoo-announce [0], where security advisories get sent. But, they only get sent out once the vulnerability has been fixed and marked stable /everywhere/, so they often come a little late. Nevertheless, security issues are fixed ASAP: 1. Some vulnerability is found. 2. The security team opens a bug, and contacts the maintainer of the affected package. 3. A fix is committed to the tree. 4. The arch teams scramble to stabilize the version with the fix. 5. The announcement is sent out. As long as you follow a semi-regular update cycle, you shouldn't have to do anything special, even if you run a stable system. The affected package will be recompiled automatically as part of the updates. Any packages *depending on* that package (like, if they're statically linked to it) will also be recompiled. No need to recompile @world enough problems by wasting time if you have to maintain 10, 20, 30 or more servers and in case of problems need fast downgrades - especially if you run virtual machines where all the compile jobs share hardware besides that on a production server no compilers should be installed at all - the generation of malware which compiles itself is only a question of time what gentoo would need to solve for professional environemnts is that you have one machine which pulls the updates, compiles them and apckage them in a way all other machines in the network can pull and apply them in precompiled from over ftp, http or whatever network protocol we are doing the same even for Fedora servers where one machine which has all package sinstalled moves them from yum/dnf-cache to a repo folder, run createrepo and all other machines have only this repo enabled and so can do a "yum -y upgrade" which can be triggered over SSH directly from the admin machine with a "distribute-updates.sh" script and a own SSH key for that task signature.asc Description: OpenPGP digital signature
Re: [OT] still configuring [Was: Disabling spamcop plugin]
On 04/13/2016 01:26 AM, Ian Zimmerman wrote: > On 2016-04-12 10:57 -0400, David Niklas wrote: > >> You could use Gentoo, you get to configure it all yourself! > > Funny you'd say that, I _am_ actually switching to it - on my > "workstation" role computers. I'm already over 50% over the hump, I > think. > > But on "server type" computers, I just cannot spare a dedicated security > branch. I really don't have the time, and more importantly the nerves, > to scramble and recompile the world when each new vulnerability is > announced. > This shouldn't be worse on Gentoo than it is anywhere else. We have a mailing list, gentoo-announce [0], where security advisories get sent. But, they only get sent out once the vulnerability has been fixed and marked stable /everywhere/, so they often come a little late. Nevertheless, security issues are fixed ASAP: 1. Some vulnerability is found. 2. The security team opens a bug, and contacts the maintainer of the affected package. 3. A fix is committed to the tree. 4. The arch teams scramble to stabilize the version with the fix. 5. The announcement is sent out. As long as you follow a semi-regular update cycle, you shouldn't have to do anything special, even if you run a stable system. The affected package will be recompiled automatically as part of the updates. Any packages *depending on* that package (like, if they're statically linked to it) will also be recompiled. No need to recompile @world. [0] https://www.gentoo.org/get-involved/mailing-lists/
Re: How does SpamAssassin processing languages other than English
On Tue, 12 Apr 2016 14:15:50 -0400 Dianne Skoll wrote: > On Tue, 12 Apr 2016 13:41:51 -0400 > Yu Qian wrote: > > > Yup, that's right, it becomes difficult if we want to support > > multiple language in one spam detection solution. and it's true > > that there are some best practice for single language. but didn't > > see too much support multiple > > The only practical approach is to normalize everything into Unicode > and tokenize Unicode characters. (We actually use UTF-8 as the > on-disk representation.) > > We have a custom Bayes engine that treats any character in the CJK > Unified Ideographs range as a word. This is not strictly correct > because there are two-character (and longer) CJK words, but it's close > enough, What happens in mainstream SpamAssassin is that if a word is over 15 bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in place of the original word. Everything can be normalized to UTF-8 with "normalize_charset 1" This will likely work fairly well for CJK, but won't work well for any 3 or 4 byte UTF-8 alphabet that isn't composed of ideograms (unless it's only in spam). This includes most Asian and African languages. I think the best solution to this is simply to retain the original long-word as a token - or to allow it as an option. Setting normalize_charset also helps with custom rules if you edit them as UTF-8, but it's important to remember that SA sees a multibyte character as a sequence of bytes rather than a single charcter. For example you can't put a non-ascii character between square brackets.