On Sun, 28 Oct 2012, Alexandre Boyer wrote:

the blocks on the right side indicate the different corpus (and user handles they belong to)

Arround 10 corpora. Are those corpora used tu run the SA mass-check on SA
servers or do it also include what I will send one day (my mc logs)?

They include both. The ones with "bb" (like mine) are where the corpora have been uploaded for central masscheck. The rest are where people have run local masschecks and only uploaded the results (i.e. which rules hit).

Is there any mean to have a geograpical mapping? Or do you think this is not relevant?

Messages from more than just the USA are very welcome, so that the standard rules become more usable worldwide.

However, there is no provision for *tracking* the geographical mapping of the corpora, for dividing rules into geographical subsets, or for running different masschecks for different geographical subsets of rules and corpora to provide regional scoring. Until there is provision for geographical subsets of rules with appropriate scoring, there isn't much utility to identifying the geographical source of the corpora.

If so, this is realy small and biased corpus (in terms of statistical
analyzis).

That can be said of pretty much every individual corpus. Taken as a whole they should provide a fairly unbiased corpora. Contributions from sources _not_ in North America can only help achieve this.

Also: ham is *very* important. If you can, provide more ham than you do spam.

I just can't wait to see my results with my french, german,
russian and chinese ham and spam messages

Agreed.

It might be useful to provide publicly-visible copies of your raw spam corpora so that people working on rules (like me) can look for common patterns in non-english languages. For example, I occasionally directly get 419 spams in spanish or french, which makes me really happy as I can use them to improve the coverage of the ADVANCE_FEE rules. Being able to look at similar content in russian and german would be very welcome.

General comment: anybody who isn't willing to publish their spam corpora is more than welcome to directly send me non-English samples of 419 spams and phish (but no chinese, please - I know my limitations!) Send as .tar.gz or RFC822 attachments, please.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...the Fates notice those who buy chainsaws...
                                              -- www.darwinawards.com
-----------------------------------------------------------------------
 3 days until Halloween

Reply via email to