Re: French rules
On Fri, 9 Dec 2011, LEVEAU Stanislas wrote: Hi I am looking for French rules with sa-update? I wrote a few some years/months ago and since the feedback was "this is too aggressive" (well, "it works for me"(tm)...) I did not go any further. I do not know where the "official" archives of this list are supposed to be, here is one: http://comments.gmane.org/gmane.mail.spam.spamassassin.general/110439 and the entry point to the rules (guess what ? It's in French...) http://www.saphirtech.com/spamassassin.html I have of course modified my rules since, if anyone is interested I can update a "v2 version", just ask. Sincerely, John GALLET
Re: remmonded max children settings
Hi, FWIW: the way I solved it was limiting the number of concurrently incoming spam because my box serves only few different domains, so I limited the number of connections from the same smtp client to 5 using iptables (the connn-limit module). This might or not be possible for others. This stopped dead "spam outbursts": your 8000 mails per day are NOT received in a linear way, but everytime a spammer sends you a "batch" and you just can not keep up: who would drink from a fire hose ? HTH JGA
Re: sa-learn process overhwelming the server
Hi, processes. It has even, on occasion, necessitated a reboot when i could not get the system to kill the process. I've taken to trying to scan it daily and manually delete the spam, but that's not always possible. This hint might be totally wrong, but last time I saw such a behavior it was linked to the process /usr/libexec/gam_server (a file alteration monitor, used by fail2ban for example) that was (uselessly) triggered by sa-learn. I just configured gamin so it would ignore the user data partition and the heavy loads disappeared. HTH JG
Re: RFC's suck
Hi, [repost from yesterday, I was not using the correct From address for this list...] Yes, it means that every Received: header in an email is valid with a valid IP, valid configuration (whatever that is deemed to be), and valid DNS. Only servers that were correctly classified as mailservers would even be able to be verified. Mailservers that were spam sources would be easily identified and blacklisted across the board. Or something. I am a bit lost here. Are you saying that right now the *main* problem with spam is "source spoofing" and that just by having a strict format for emails in the protocole we would turn the whole spam fighting industry into a single huge database of "known spammers" ? If it were true, I think it still raises a few issues. First, not all providers agree to implement SPF currently. One of the reasons being that I am [EDIT: usually !!] using a totally legitimate @wanadoo.fr address but I am not currently connected through their network. So basically all providers would have to issue SSL certificates for all their clients that could be of course stolen by malware etc. and become "legit". What I mean is that the very first step of email sending between my favorite mail reader and the SMTP/IMAP server would still be a weak point (Like we say in French "l'enemi, c'est l'utilisateur" i.e. "the end-used is the enemy/the source of all evil"). And I am not even talking about all the email domains that are not ISPs such as hotmail or yahoo and the like: you can secure all the way from their server to yours, it will never prevent the "garbage in - garbage out" approach. Second, I am not aware of any lawsuits yet against RBLs but I am quite surprised no "official spammer" has already done that, or tried direct attacks targeted at the RBLs servers: they have enough zombies to spam, so they could. Furthermore, let me be the devil's advocate for a second: would not you agree that many a rule in SA can actually catch spam because they are RFC compliant but stupid enough to add fancy headers or fancy header formatting ? > Putting this on a distinct port seems more a marketing thing. Why not > add it as a capability in a normal SMTP server? Because the idea is to be able to simply retire the current SMTP and that will be a lot simpler if the new service is on a new port. I would agree to that. Http already does (i.e. port 8080 vs port 80), it would facilitate migration. A secure verifiable delivery chain from server to server would almost completely eliminate the need for SA. I can not agree to that. The point of entry has to be secured and I am afraid it will be a pain to do so. And I'm not saying it would be easy, or happen over-night. I figure if people started working on an RFC right now we might see the end of the current SMTP in 15-20 years unless there was a huge push in which case it could maybe happen in 10-15 years. Might be. We have been talking over and over about IPV6 for about 15 years now, and currently it only incurs problems between compliant and non-compliant equipmentswith zero gain. I'm not saying it would ELIMINATE Spam, but it would certainly reduce it to a manageable level. Having an authenticated chain can only help if it is not broken or if we can detect it was broken, otherwise it will have the reverse effect of spammers injecting massive spam into "trusted" network chains that can not be banned for fear of hitting legit users. Nothing we're doing now is reducing it at all, the amount of spam has been increasing steadily every year since the very first Green Card posting to USENET. Amen to that. To come back (a little) to the original post, IMHO we can not and should not do without specs i.e. RFCs. THe existing ones are not that bad, I sent my firts emails back in 1992 with my mom's address, so these RFCs have made the world communicate (for better and worse) for 20 years. Back at the time, the bandwith was so low and email access so controlled (add to that a tiny bit of optimism about the kind human nature) that spam was not an issue. A new RFC can be needed, but I really can not believe its main improvement would be protocole formatting... HTH JG
Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules
Re, Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't read this mail in html, click here). It might be worth collecting more ham that includes any such common text -- or even _generating_ mails along those lines (just edit the message body to include the text you want the ruleset to avoid. ;) Well, that's the whole point: can we conclude that an email with an unsubcribe link tends to be a spam more often than a ham ? I consider so, but with a low score. Can we conclude that an email citing the French Law "informatique et libertés" is a spam ? I would say "100% except government sponsored mailing lists that may feel obliged to do so", so I added a higher score. Now it might perfectly be faulty logic, I do not have any experience in spam fighting. I also adapted this one (paths of course, but also forced "mbox" format, "detect" spit out zero results) ah. forgot to mention: detect only treats files that end in ".mbox" as mboxes. ;) :-) ok, well anyway it was quite easy to find out since it worked well when forcing and not at all in automatic. Thanks for trying it out! Well, thanks for writing it. I think its main weak point for French and other accented languages is handling the different encodings for a same char with an accent, some kind of "synonyms" list. The same letter, say "a with an accent", can be misspelled with a plain "a", encoded in various charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & and ; out). I do not know if it is possible at all, it might complicate things *a lot*. a++; JG
Re: French advance fee fraud ruleset
In a similar vein to the "Nigerian" advance fee fraud, here's a ruleset for French-language scams, often originating from Côte d'Ivoire. http://www.tradoc.fr/spamassassin/fraude_fr.cf All comments welcome. Thanks, some of these still getting (rarely, but still so) through. cat fraude_fr.cf >>$HOME/.spamassassin/user_prefs and I will keep you posted. They look good (from memory, I have no sample). Just a quick question about: ifplugin Mail::SpamAssassin::Plugin::ReplaceTags replace_tag AGRAVE [a\xC0\xE0] What happens with the agrave htmlentity ? I mean if the received spam is htmlentity encoded, or mixes utf-8 accents and ascii-htmlentity ? JG
Philosophy for opt-in (was Re: [Rule Set proposal] French Rules
Hi, If these are hit rates with a very minimal daily corpus, don't know if the present ruleset is ready for production unless you have 0 tolerance for any bulk, period I'm afraid I must agree. I don't have a confirmed and sorted corpus per se, but after a single night's live testing with very low scores I can confirm that, as I suspected, many of these rules hit genuine opt-in newsletters and even things like ebay notifications in French. Thanks for the feedback. I do not have any ebay subscriptors in my users, except one power-seller who has ebay thingies in whitelist. I will however keep the ruleset live for a while, to see whether the online meds and onling gambling rules actually hit anything. They should, they do on my machines. But actually, they are only useful for a "new" spam that has not been caught yet by RBL. When I wrote them, it was because spam *was* getting through, now they just push towards "almost-probably-spam". Another note is that much of this particular spam is auto and badly translated (much "pidgin-French" if I can say so). My personal tolerance for bulk mail is pretty low, and in a way I'd love to use rules like these, with just a bit of fine tuning - the rules do also hit a fair bit of French spam. But unfortunately my users actually want to receive their newsletters and even complain if it ends up in their spam folder. I think I have a newbye simple problem of philosophy/strategy: my approach, for what it's worth, was that I flag anything that contains some unsubscribe links and French law reminders because anyway all the ones I receive are spam, and I add the opt-in mailing/newsletter I receive to whitelist_from in user_prefs, i.e. I kill everything except those explicitly allowed. If that is not the correct approach, I can garantee you the current way the rules are written is bad (too harsh), and I need strategy advice on how to manage opt-in lists. John
Re: hit frequencies (was Re: [Rule Set proposal] French Rules
Re, I excluded the last two rules from my masscheck to avoid FPs as these ESPs/X-Mailers are definitely grey, "import rcpt list and blast" sort of ESPs not black for global use. If you can point me to some more information on how to do that, on-list or off-list, I am interested. I am new to this whole business. In fact I was forced to look at X-Mailer and other strange headers for French spam that was still getting through with no real easy keywords, and these guys often ad the good idea to have developped their own "software" and be proud of it. #counts FR_SPAMISLEGAL 8s/2h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_SPAMISLEGAL_2 5s/2h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_NOTSPAM 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_PAYLESSTAXES 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_REALESTATE_INVEST 0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_ONLINEGAMBLING0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_ONLINEMEDS0s/0h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_REASON_SUBSCRIBE 1s/1h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 #counts FR_HOWTOUNSUBSCRIBE 7s/16h of 3859 corpus (1166s/2693h AXB-MC1) 06/23/08 If these are hit rates with a very minimal daily corpus, don't know if the present ruleset is ready for production unless you have 0 tolerance for any bulk, period I do subscribe to various mailing lists, and none of them seemed compelled to remind me how to unsubscribe, even less to state me the law about spam. Even the official government "conseil des ministres" (sum up of the daily/weekly/whatever government meeting) does not state the "loi informatique et libertés" anymore (but they do use a company I am getting a lot of spam from ). So basically the question is: what makes a spam in French recognizable. On the other hand I am also worried about the very low hits of most rules. If all your 1166 spams are in French, we can throw the whole ruleset to /dev/null (well I'll keep it for me anyway). A++; JG
seekrules over French spam (was Re: [Rule Set proposal] French Rules
Hi, You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out the patterns; you can then write rules based on these. I did so, the results are interesting, though I do not really know where to go from there. If I take the first 50 "best" patterns and strip off the obvious stand-alone words and sure-to-be-false-positive expressions, here is what I get to: (sorry for non French speakers, explanation below) RATIO SPAM%HAM% DATA 1.000 9.375 0.000 /Pour ne plus recevoir / 1.000 6.875 0.000 /6 janvier 1978 relative / 1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/ 1.000 5.625 0.000 /s données nominatives / 1.000 5.625 0.000 / ce message, cliquez-ici/ 1.000 5.625 0.000 / vous désinscrire de / 1.000 5.000 0.000 /Conformément à l/ 1.000 5.000 0.000 / plus recevoir d\'informations de notre part/ 1.000 5.000 0.000 /un droit d\'accès/ 1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/ 1.000 4.375 0.000 /ment à l\'article 34 de la loi / 1.000 3.750 0.000 /ous désinscrire de notre / 1.000 3.750 0.000 /es nominatives vous concernant\. / 1.000 3.750 0.000 / Libertés du 6 / 1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, / As you can see, charset encoding makes a mess, and many must be regrouped. Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't read this mail in html, click here). The whole result is available at http://www.saphirtech.fr/spam/seekrules_fr_1.txt http://taint.org/x/2008/seekrules_run I also adapted this one (paths of course, but also forced "mbox" format, "detect" spit out zero results), but the result is even less "readable" for me. I miss the script seekrules/kill_bad_patterns which I presume removes stand alone words and such things. Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt John
Re: hit frequencies (was Re: [Rule Set proposal] French Rules
Thanks for taking this burden upon yourself. One other thing you should be prepared to do, if you're willing to devote long-term responsibility to these rules, is to provide sa-update-compatible feeds of your dynamic rules. This is another thing that Justin can probably help you with. I am happy with trying to do so, but I am honestly not worried about the feed part, all it bores down to is putting the right file at the right place (be it push or pull, ftp or rsync, whatever). What I am more worried about is testing regularly the rules, and, even before that, checking that they are valid. They are "good" on my system with my users, but then they were custom-tailored to be so. JG
Re: hit frequencies (was Re: [Rule Set proposal] French Rules
Re, Looking at the rules, I'm worried about false positives on genuine opt-in advertising. I have a number of users who choose to receive all kinds of advertising blurb, This is one of the reasons why I did not hunt for "click here" and "if you can't see this email in html". Now correct me if I am wrong (ouch, no, not on the head), but isn't this what whitelist_from is for ? I never was able to let the Intel newsletter through (it is in English), it would always be caught by SA. Same went for Microsoft Support genuine answers (ok, don't laugh). so I'll run your rules with very low scores for a while to see what gets hit. You can have a little more information, and exactly this suggestion, by reading http://www.saphirtech.fr/spamassassin.html JG
hit frequencies (was Re: [Rule Set proposal] French Rules
Hi, First of all, thanks to Justin for patiently helping me to install mass-check and pointing me in the right direction. I will try to run the algorithms tonight to see what they come up with. In the meantime, you can find a hit-frequencies report at: http://www.saphirtech.fr/spam/freqs_2008_06_23.txt All rules are prefixed with FR_ and are available in the same directory. I must say I did not double check for stray spam in my mailbox before using it as a ham corpus but it *should* be clean. I'll double check for next run. The spam corpus was 100% French spam, hand-picked over the last week through the "probably-spam" class (default score values 5-15). Any feedback on the results (not enough in corpus, bad rules, good rules, etc.) appreciated. Sincerely, JG
Re: [Rule Set proposal] French Rules
I still miss samples for two rules, even if I did had hits according to /var/spool/maillog I did not save them. I added a sample for the FR_NOTSPAM rule, and I removed the FR_YOURELUCKY rule as I see other forms of the text getting through so it is not efficient. On the other hand, nearly all these messages are caught with RBL rules so I might even remove it completely if I can't find an efficient one. John PS: reminder, rules and samples avaible at http://www.saphirtech.fr/spam/
Re: [Rule Set proposal] French Rules
Hi, I was able to access the URL you mentioned, but not all of the files below it. I received: "Forbidden You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server." Sorry guys, only the ruleset file (the one I tried, of course) was readable, all the non empty spam samples had bad rights. This is fixed. I still miss samples for two rules, even if I did had hits according to /var/spool/maillog I did not save them. John
[Rule Set proposal] French Rules
Hi, This is my first post on this list and first ruleset, so please point me to the right place/documents if I am doing anything wrong. According to a search of this list on markmail.org, there have been few subjects about spam in French and (no disrespect meant) I would agree with the comments I read about the current French Ruleset being inadequate (tried it, did not keep any of it). So I would like to propose a set for French Rules and get your feedback. You can find both the rules and some sample spam email messages (two of them missing, I have hits in my log files, but deleted them) at the following URL: http://www.saphirtech.fr/spam/ I have been running these for about a month sitewise on three domains, I have not seen any false positives (yet). Sincerely, JG # # FRENCH SPECIFIC SPAMASSASSIN RULES. # USE AND REDISTRIBUTE WITH THIS NOTE AT YOUR OWN RISK AND PLEASURE. # AUTHOR: John GALLET # Version: 2008-JUNE-17 # Latest: http://www.saphirtech.fr/ # Status: It Works For Me (tm) # # Spam is legal in France ! body FR_SPAMISLEGAL /\b(Conform.+ment|En vertu).{0,5}(article.{0,4}34.{0,4})?la loi\b/i describe FR_SPAMISLEGAL French: pretends spam is (l)awful. lang fr describe FR_SPAMISLEGAL Invoque la loi informatique et libertes. score FR_SPAMISLEGAL2.5 body FR_SPAMISLEGAL_2 /\bdroit d.acc.+s.{1,3}(de modification)?.{0,5}de rectification\b/i describe FR_SPAMISLEGAL_2 French: pretends spam is (l)awful. lang fr describe FR_SPAMISLEGAL_2 Invoque le droit de rectification cnil. score FR_SPAMISLEGAL_2 2.5 # # yeah, sure. body FR_NOTSPAM /\b(ceci|ce).{1,9} n.est pas.{1,5}spam\b/i describe FR_NOTSPAM French: claims not to be spam. lang fr describe FR_NOTSPAM Affirme ne pas etre du spam. score FR_NOTSPAM4.0 # ## I can pay my taxes body FR_PAYLESSTAXES /\b(paye|calcul|simul|r.+dui|investi).{1,7}(moins|vo|ses).{0,5}imp.+t(s)?\b/i describe FR_PAYLESSTAXESFrench: Pay less taxes lang fr describe FR_PAYLESSTAXESSimulateurs et reductions d'impots. score FR_PAYLESSTAXES 2.0 body FR_REALESTATE_INVEST /\b(loi)? (de.robien|girardin).{1,15}(neuf|recentr.+|ancien|IR|IS|imp.+t(s)?|industriel(le)?)\b/i describe FR_REALESTATE_INVEST French: Invest in real-estate with tax-reductions lang fr describe FR_REALESTATE_INVEST Reduction impots immobilier. score FR_REALESTATE_INVEST 2.5 # # I won at the casino body FR_ONLINEGAMBLING /\b(casino(s)?|jeu(x)?|joueur(s)?) (en ligne|de grattage)\b/i describe FR_ONLINEGAMBLING French: Online gambling lang fr describe FR_ONLINEGAMBLING Jeux en ligne. score FR_ONLINEGAMBLING 2.0 # # I am so lucky to receive spam body FR_YOURELUCKY /\b(tentez)? votre (jour de)? chance\b/i describe FR_YOURELUCKY French: it's your lucky day (sure). lang fr describe FR_YOURELUCKY Jeux de hasard et de chance. score FR_YOURELUCKY 1.0 # # Baby, did you forget to take your meds ? body FR_ONLINEMEDS /\bpharmacie(s)? (en ligne|internet)\b/i describe FR_ONLINEMEDS French: Online meds ordering lang fr describe FR_ONLINEMEDS Achat de medicaments en ligne. score FR_ONLINEMEDS 3.0 ## # Tell me why body FR_REASON_SUBSCRIBE/\bVous recevez ce(t|tte)? (message|mail|m.+l|lettre|news.+) (car|parce que)\b/i describe FR_REASON_SUBSCRIBEFrench: you subscribed to my spam. lang fr describe FR_REASON_SUBSCRIBEIndique pourquoi vous recevez le courrier. score FR_REASON_SUBSCRIBE 1.5 # # How to unsubscribe body FR_HOWTOUNSUBSCRIBE /\b(souhaitez|d.+sirez|pour).{1,10}(plus.{1,}recevoir|d.+sincrire|d.+sinscription).{0,10}(information|email|mail|mailing|newsletter|message|offre|promotion)(s)?\b/i describe FR_HOWTOUNSUBSCRIBEFrench: how to unsubscribe lang fr describe FR_HOWTOUNSUBSCRIBEIndique comment se desabonner. score FR_HOWTOUNSUBSCRIBE 2.0 # Various "CRM" (Could Remove Me) # header FR_MAILER_1 X-Mailer =~ /(delosmail|cabestan|ems|mp6|wamailer|phpmailer|eMailink|Accucast|Benchmail)/i describe FR_MAILER_1French spammy X-Mailer lang fr describe FR_MAILER_1X-Mailer couramment employe pour des spams en francais. score FR_MAILER_1 4.0 header FR_MAILER_2 X-EMV- =~ /.+/ describe FR_MAILER_2