Re: Spamassassin uses bayes, but spamd doesn't

2016-06-16 Thread Yu Qian
you can use spamd -D to check the log for exactly what bayes db path your
spamd was using.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Thu, Jun 16, 2016 at 7:03 PM, Reindl Harald <h.rei...@thelounge.net>
wrote:

>
>
> Am 16.06.2016 um 19:46 schrieb Sebastian Arcus:
>
>> I have a particular server running spamd which uses bayes every time I
>> test it by hand, but apparently never when it goes through exim/spamd
>>
>
> then you need to run it as the correct user or train it as the correct user
>
>


Re: SA bayes file db permission issue

2016-06-09 Thread Yu Qian
Good point, David, I will try as you suggested, that makes more sense.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Thu, Jun 9, 2016 at 5:01 PM, David B Funk <dbf...@engineering.uiowa.edu>
wrote:

> On Thu, 9 Jun 2016, Yu Qian wrote:
>
> Yes, I am sure the path is correct, also, if the path is not correct, it
>> will show 'db not present'.
>> I tried to write a small perl script to open the db file, it failed too.
>> so I think it maybe the file damaged during the mounting. but I
>> don't know why this can happen
>>
>> ---
>> Yu Qian
>> Ottawa Ontario
>> Phone: (514)-553-0198
>>
>>
>>
>> On Thu, Jun 9, 2016 at 4:24 PM, John Hardin <jhar...@impsec.org> wrote:
>>   On Thu, 9 Jun 2016, Yu Qian wrote:
>>
>> My spam assassin works pretty well if I run it on a single
>> machine, either
>> mac or linux. that means I update my rules and train my bayes
>> model on the
>> same machine.
>>
>> But when I tried to train the model and generate bayes file
>> db  on mac, and
>> I mounted them to a docker container, then sa-learn failed to
>> read the DB.
>> the permission looks good, because the error just show
>> "failed to open
>> bayes_toks"
>>
>> Anyone know the potential problems?
>>
>>
> Check the version number of the BerkekeyDB libraries on the two different
> machines. There are binary-data compatability issues between some of the
> versions. (EG a db file created by v3.0 cannot be opened by v4.2 IIRC).
>
> You may have to do a bayes "-backup" on the one system and a "-restore"
> on the other.
>
>
> --
> Dave Funk  University of Iowa
> College of Engineering
> 319/335-5751   FAX: 319/384-0549   1256 Seamans Center
> Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
> #include 
> Better is not better, 'standard' is better. B{


Re: SA bayes file db permission issue

2016-06-09 Thread Yu Qian
Ok, I found out. so the db files generated on Mac can not be used on Linux.
vice versa.

I think this is related to the way how perl DBM module processing the db
files on different system. I am totally new to perl.

But it's good to know that. thanks all.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Thu, Jun 9, 2016 at 4:38 PM, Alan Hodgson <ahodg...@lists.simkin.ca>
wrote:

> On Thursday 09 June 2016 16:26:26 Yu Qian wrote:
> > Yes, I am sure the path is correct, also, if the path is not correct, it
> > will show 'db not present'.
> >
> > I tried to write a small perl script to open the db file, it failed too.
> so
> > I think it maybe the file damaged during the mounting. but I don't know
> why
> > this can happen
> >
>
> The docker container probably has a different DB version than your Mac.
>
>


Re: SA bayes file db permission issue

2016-06-09 Thread Yu Qian
Yes, I am sure the path is correct, also, if the path is not correct, it
will show 'db not present'.

I tried to write a small perl script to open the db file, it failed too. so
I think it maybe the file damaged during the mounting. but I don't know why
this can happen

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Thu, Jun 9, 2016 at 4:24 PM, John Hardin <jhar...@impsec.org> wrote:

> On Thu, 9 Jun 2016, Yu Qian wrote:
>
> My spam assassin works pretty well if I run it on a single machine, either
>> mac or linux. that means I update my rules and train my bayes model on the
>> same machine.
>>
>> But when I tried to train the model and generate bayes file db  on mac,
>> and
>> I mounted them to a docker container, then sa-learn failed to read the DB.
>> the permission looks good, because the error just show "failed to open
>> bayes_toks"
>>
>> Anyone know the potential problems?
>>
>
> Are you sure the path is correct?
>
> Run sa-learn in debug mode to see where it's looking for the bayes DB.
>
>
> --
>  John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
>  jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
>  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>   ...wind turbines are not meant to actually be an efficient way to
>   supply the power grid, rather they're prayer wheels for New Age
>   iBuddhists, their whirring blades drawing white guilt from the
>   atmosphere and pumping it safely underground.-- Tam
> ---
>  170 days since the first successful real return to launch site (SpaceX)
>


Re: SA bayes file db permission issue

2016-06-09 Thread Yu Qian
Ok, I think it is just because the db file can not be open by perl DBM
module, but I am confused why it can't be open

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Thu, Jun 9, 2016 at 4:11 PM, Yu Qian <jinggeqianyu1...@gmail.com> wrote:

> My spam assassin works pretty well if I run it on a single machine, either
> mac or linux. that means I update my rules and train my bayes model on the
> same machine.
>
> But when I tried to train the model and generate bayes file db  on mac,
> and I mounted them to a docker container, then sa-learn failed to read the
> DB. the permission looks good, because the error just show "failed to open
> bayes_toks"
>
> Anyone know the potential problems?
>
> thanks
>
>
>
>


SA bayes file db permission issue

2016-06-09 Thread Yu Qian
My spam assassin works pretty well if I run it on a single machine, either
mac or linux. that means I update my rules and train my bayes model on the
same machine.

But when I tried to train the model and generate bayes file db  on mac, and
I mounted them to a docker container, then sa-learn failed to read the DB.
the permission looks good, because the error just show "failed to open
bayes_toks"

Anyone know the potential problems?

thanks


Re: How does SpamAssassin processing languages other than English

2016-04-13 Thread Yu Qian
Cool, thanks guys, i think I have a good sense of how SpamAssassin works
now. we are doing some spam project, that's amazing to have SpamAssassin.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Wed, Apr 13, 2016 at 8:21 AM, RW <rwmailli...@googlemail.com> wrote:

> On Tue, 12 Apr 2016 14:15:50 -0400
> Dianne Skoll wrote:
>
> > On Tue, 12 Apr 2016 13:41:51 -0400
> > Yu Qian <jinggeqianyu1...@gmail.com> wrote:
> >
> > > Yup, that's right, it becomes difficult if we want to support
> > > multiple language in one spam detection solution. and it's true
> > > that there are some best practice for single language. but didn't
> > > see too much support multiple
> >
> > The only practical approach is to normalize everything into Unicode
> > and tokenize Unicode characters.  (We actually use UTF-8 as the
> > on-disk representation.)
> >
> > We have a custom Bayes engine that treats any character in the CJK
> > Unified Ideographs range as a word.  This is not strictly correct
> > because there are two-character (and longer) CJK words, but it's close
> > enough,
>
> What happens in mainstream SpamAssassin is that if a word is over 15
> bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
> place of the original word. Everything can be normalized to UTF-8 with
> "normalize_charset 1"
>
> This will likely work fairly well for CJK, but won't work well for any 3
> or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
> it's only in spam). This includes most Asian and African languages.
>
> I think the best solution to this is simply to retain the original
> long-word as a token - or to allow it as an option.
>
> Setting normalize_charset also helps with custom rules if you edit them
> as  UTF-8, but it's important to remember that SA sees a multibyte
> character as a sequence of bytes rather than a single charcter. For
> example you can't put a non-ascii character between square brackets.
>


Re: How does SpamAssassin processing languages other than English

2016-04-12 Thread Yu Qian
That's nice to hear SpamAssassin can looks at word pairs, As I am new to
SpamAssassin, so still trying to find out more interesting things of it.

According to the word pairs stuff, does SpamAssassin can detect word like
this: if a single word is splitted by space, like Free appeared in a email
as the format F R E E. ?


---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Tue, Apr 12, 2016 at 2:15 PM, Dianne Skoll <d...@roaringpenguin.com>
wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian <jinggeqianyu1...@gmail.com> wrote:
>
> > Yup, that's right, it becomes difficult if we want to support multiple
> > language in one spam detection solution. and it's true that there are
> > some best practice for single language. but didn't see too much
> > support multiple
>
> The only practical approach is to normalize everything into Unicode and
> tokenize Unicode characters.  (We actually use UTF-8 as the on-disk
> representation.)
>
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough, especially because our Bayes engine also looks at word pairs.
>
> I think this is a Summer of Code project for SpamAssassin. :)
>
> Regards,
>
> Dianne.
>


Re: How does SpamAssassin processing languages other than English

2016-04-12 Thread Yu Qian
Yup, that's right, it becomes difficult if we want to support multiple
language in one spam detection solution. and it's true that there are some
best practice for single language. but didn't see too much support multiple

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Tue, Apr 12, 2016 at 1:38 PM, Reindl Harald <h.rei...@thelounge.net>
wrote:

> STAY ON LIST
>
> Am 12.04.2016 um 19:22 schrieb Yu Qian:
>
>> Yes, right, what I am interested is that as Chinese language is
>> different. so does SpamAssassin has a strong tokenizer to do that? or
>> they just use the same tokenizer?
>>
>> ---
>> Yu Qian
>> Ottawa Ontario
>> Phone: (514)-553-0198
>>
>>
>>
>> On Tue, Apr 12, 2016 at 1:16 PM, Reindl Harald <h.rei...@thelounge.net
>> <mailto:h.rei...@thelounge.net>> wrote:
>>
>>
>>
>> Am 12.04.2016 um 18:44 schrieb Yu Qian:
>>
>> SpamAssassin used Bayes as classier, this is typical and
>> efficient for
>> English. But how does it processing languages like Asian language?
>>
>> Can anyone introduce that or anyone can show the code where
>> SpamAssassin
>> do that?
>>
>>
>> bayes is by definition language agnostic
>>
>> *you train* bayes with samples of ham and spam (at least a few
>> hundret of both) and the tokenizer splits the messages in parts and
>> creates a database which words appear how often in spam and ham
>> (simplified explained)
>>
>>
>>
>>
>>
> --
>
> Reindl Harald
> the lounge interactive design GmbH
> A-1060 Vienna, Hofmühlgasse 17
> CTO / CISO / Software-Development
> m: +43 (676) 40 221 40, p: +43 (1) 595 3999 33
> icq: 154546673, http://www.thelounge.net/
>
> http://www.thelounge.net/signature.asc.what.htm
>
>


How does SpamAssassin processing languages other than English

2016-04-12 Thread Yu Qian
SpamAssassin used Bayes as classier, this is typical and efficient for
English. But how does it processing languages like Asian language?

Can anyone introduce that or anyone can show the code where SpamAssassin do
that?

Thanks, guys!