On Sun, 28 Oct 2012, Alexandre Boyer wrote:
the blocks on the right side indicate the different corpus (and user
handles they belong to)
Arround 10 corpora. Are those corpora used tu run the SA mass-check on SA
servers or do it also include what I will send one day (my mc logs)?
They include both. The ones with "bb" (like mine) are where the corpora
have been uploaded for central masscheck. The rest are where people have
run local masschecks and only uploaded the results (i.e. which rules hit).
Is there any mean to have a geograpical mapping? Or do you think this is
not relevant?
Messages from more than just the USA are very welcome, so that the
standard rules become more usable worldwide.
However, there is no provision for *tracking* the geographical mapping of
the corpora, for dividing rules into geographical subsets, or for running
different masschecks for different geographical subsets of rules and
corpora to provide regional scoring. Until there is provision for
geographical subsets of rules with appropriate scoring, there isn't much
utility to identifying the geographical source of the corpora.
If so, this is realy small and biased corpus (in terms of statistical
analyzis).
That can be said of pretty much every individual corpus. Taken as a whole
they should provide a fairly unbiased corpora. Contributions from sources
_not_ in North America can only help achieve this.
Also: ham is *very* important. If you can, provide more ham than you do
spam.
I just can't wait to see my results with my french, german,
russian and chinese ham and spam messages
Agreed.
It might be useful to provide publicly-visible copies of your raw spam
corpora so that people working on rules (like me) can look for common
patterns in non-english languages. For example, I occasionally directly
get 419 spams in spanish or french, which makes me really happy as I can
use them to improve the coverage of the ADVANCE_FEE rules. Being able to
look at similar content in russian and german would be very welcome.
General comment: anybody who isn't willing to publish their spam corpora
is more than welcome to directly send me non-English samples of 419 spams
and phish (but no chinese, please - I know my limitations!) Send as
.tar.gz or RFC822 attachments, please.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...the Fates notice those who buy chainsaws...
-- www.darwinawards.com
-----------------------------------------------------------------------
3 days until Halloween