Re: xxxl spam
On Apr 14, 2006, at 12:40 AM, Michael Monnerie wrote: On Freitag, 14. April 2006 06:32 Paul R. Ganci wrote: Start young when it is easy for kids to pick up the sounds. Yes, my daughter has the advantage of learning german with me, french with my wife, and later at school she will learn english anyway. Still, people in Belgium have it more easy: in addition to en,de,fr, they learn dutch and their local flavor, a mix of all languages (which dutch is already anyways). The most funny party concerning languages I had was on Crete (and island of Greece): It was a party where all the tourist guides were, about 20 people and at least 9 different languages, where each could speak at least 2, often 4... now that's a mess :-) My favorite story isn't that extreme. It's about a friend of mine who went and did his senior year of HS in study abroad. He had learned German in HS, but was sent to Denmark (close) and spent that year learning the language. When he came back, there was this big party thing in Washington DC for all of the exchange students going in both directions. He came back to the US not having spoken any English for a year, and was put in a hotel room with someone who had been in Germany not speaking English for a year, and a German who had been speaking only English for a year. So, none of them was entirely comfortable going back to speaking their native language yet, none of them had been speaking the same language as the other two during that year ... and they stayed up all night talking. At first, each just spoke the language they had been speaking for the year, and the other two just understood. I think Daniel said that by morning, he was speaking English again :-}
Re: xxxl spam
On Freitag, 14. April 2006 06:32 Paul R. Ganci wrote: > Start young when it is easy for kids to pick up the sounds. Yes, my daughter has the advantage of learning german with me, french with my wife, and later at school she will learn english anyway. Still, people in Belgium have it more easy: in addition to en,de,fr, they learn dutch and their local flavor, a mix of all languages (which dutch is already anyways). The most funny party concerning languages I had was on Crete (and island of Greece): It was a party where all the tourist guides were, about 20 people and at least 9 different languages, where each could speak at least 2, often 4... now that's a mess :-) mfg zmi -- // Michael Monnerie, Ing.BSc- http://it-management.at // Tel: 0660/4156531 .network.your.ideas. // PGP Key: "lynx -source http://zmi.at/zmi3.asc | gpg --import" // Fingerprint: 44A3 C1EC B71E C71A B4C2 9AA6 C818 847C 55CB A4EE // Keyserver: www.keyserver.net Key-ID: 0x55CBA4EE pgpfa5aGhC2IS.pgp Description: PGP signature
Re: xxxl spam
mouss wrote: >> and I've got plenty of users that speak > > >>multiple languages, not all of which use plain-ascii. >> >> >> > >I guess so. now I'm not sure our situation isn't worst because people >tried to find non standard solutions that are still used. I still >remember the days when some customers were asking us to "fix" our >software because "it broke their accents"... hopefully these times are >gone, but I still see "broken" mail (much more than I should). actually, >I also see mail that doesn't get rendered correctly on thunderbird. so >I'll admit that the issue isn't really about accented chars... > > This is a real sore point for me. I worked on the Mime quoted-printable encoding 14 years ago, and in some ways we haven't come nearly as far as we should have (see my posts as [EMAIL PROTECTED] when I was at France Telecom). A lot of it has to do with idiots like Microsoft pushing competing standards (like Windows-1251) that offer no advantage whatsoever over their established standards (like ISO Latin-1) and serve only to increase the exponential problem of interoperability matrices... the number of ways each agent must be tested against other agents, etc... thereby guaranteeing that complete testing of all possible permutations becomes an unattainable goal receding ever more quickly towards the horizon Where we could have been smart and limited ourselves to a manageable and very finite set of permutations instead... This is why our site has the following rule: # don't allow windows-125x text attachments... mimeheader __CTYPE_MH_WIN1252 Content-Type =~ /charset=\"windows-125[0-8]\"/i meta L_WIN_CHARSET ((__CTYPE_MH_HTML || __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252) describe L_WIN_CHARSET Content-Type is Windows-specific text score L_WIN_CHARSET 0.1 should probably do the same for non-MIME content, but it's not as much of a problem since Outlook prefers MIME content. If anyone wants to talk to us, they can stick with ISO Latin-1. We don't need no stinkin' Windows-125x... (or -839 for that matter). -Philip
Re: xxxl spam
Loren Wilton wrote: I predict that the US will be the first country in the 21th century to abandon English as the national language, while almost all other countries seem to be mandating that their citizens learn English. Loren The problem with the US is that we are linguistic idiots (a quote from Columbia University German Professor). If you go to Europe in general they speak at least two languages fluently. English and the country's native language. I have had the opportunity to work in both Geneva, Switzerland and and Milan, Italy. All business is conducted in English and everything else in Italian or in the case of Switzerland either German, Swiss German or French. Essentially all the engineers with whom I worked could speak two languages or in some cases four. I don't know what the big deal is. It shouldn't be "one" language but at least two here in the US. Start young when it is easy for kids to pick up the sounds. Unfortunately I am still a linguistic idiot and only speak English ... a Buffalo, NY version at that! My grand parents came over from Italy in 1920 and promptly stopped speaking Italian around my parents. It forced my parents to learn English at the cost of never learning Italian. There is plently of room to accomodate two languages but neither the US education system or home life is set up to do it. -- Paul ([EMAIL PROTECTED])
Re: xxxl spam
> states like California where it could matter (reducing costs in govt > overhead by eliminating multiple languages and the requirement for > multilingual workers), the "English as state language" supporters are > afraid of what almost happened in Florida. Considering that at last census a "minority" of 54% of California residents spoke Spanish as their primary or only language... I predict that the US will be the first country in the 21th century to abandon English as the national language, while almost all other countries seem to be mandating that their citizens learn English. Loren
Re: xxxl spam
On Apr 13, 2006, at 11:40 AM, mouss wrote: Matt Kettler wrote: And even us US folks do have encoding issues. After all, English is not our official language here in the US, what do you mean here? what would be your official language? The US doesn't have an official language. By default, it is assumed to be English for most things, but it's not "Official". And, in some regions within the US, official govt signs and documents come in various languages (the reasons why this is true has to do with liability and legality; since there's no official language, you can't just pick _one_ language to publish your forms in, and be done with it; if you do, you're neglecting significant minority populations (and in some regions, those can be quite significant, such as spanish speakers in southern Florida or southern California), which then makes you vulnerable to law suits saying that you're discriminating and/or being negligent toward those significant minorities who aren't required to speak English, because English isn't an official language). In order to simplify this, some states have tried to enact official language legislation. Florida tried it. Someone put "Make English the official state language" on a ballot. The Cuban-American population in southern Florida got mad, and put "Make Spanish the official state language" on the ballot. Neither one passed, but the Spanish one got more votes. This pretty much silenced the "English as state language" movement in Florida, as their plan almost backfired on them. I don't remember any other state trying it since. The states where there wouldn't be any opposition don't need to make it a law ... and in states like California where it could matter (reducing costs in govt overhead by eliminating multiple languages and the requirement for multilingual workers), the "English as state language" supporters are afraid of what almost happened in Florida. So ... sorry for the long winded explanation, but that's what he was saying.
Re: xxxl spam
mouss wrote: >> However, it is true that the vast majority of the corpus currently >> comes from >> folks who speak English (King's or Yankee) as a primary language, and >> that's a >> bit of a problem as it creates considerable bias in the rules. >> >> And even us US folks do have encoding issues. After all, English is >> not our >> official language here in the US, > > what do you mean here? what would be your official language? The United States of America does not have any official language. Americanized English is our common language, but it's not official. This means that our government has to supply forms and materials in many languages for its citizens, because it cannot require that citizens speak English. For example, we have tax forms in French: http://www.irs.gov/pub/irs-access/f2290fr_accessible.pdf Admittedly non-english forms and services are somewhat secondary here, but they are present. > > and I've got plenty of users that speak >> multiple languages, not all of which use plain-ascii. >> > > I guess so. now I'm not sure our situation isn't worst because people > tried to find non standard solutions that are still used. I still > remember the days when some customers were asking us to "fix" our > software because "it broke their accents"... hopefully these times are > gone, but I still see "broken" mail (much more than I should). actually, > I also see mail that doesn't get rendered correctly on thunderbird. so > I'll admit that the issue isn't really about accented chars... > Well, yours is certainly worse, or at least more prevalent, than the problem here in the US, but I would not say it's the worst. Generally speaking the worst case seems to be present in smaller Asian nations, which have really extensive use of non-us characters. At least the French can restrict their text to the same character set as English and still be readable, although awkward due to the screwed up accents. Also, smaller Asian nations still to this day have a high prevalence of locally-grown mail clients, many of which are not even remotely RFC compliant, but work well with others in the same locale. They're also much more likely to make use of mixed-language text containing many character sets. Speaking 2 or 3 different languages is fairly common in the smaller countries of the Asian region, just due to necessity for trade with neighboring countries. Another area with this same basic issue would be the middle-east, but the number of completely different character sets is smaller.
Re: xxxl spam
Matt Kettler wrote: mouss wrote: I also understand that US guys may get less encoded subjects, but at least in .fr, we have that all the time (because of our accented letters, and because many companies still use software that predates mime). and if I find a legitimate IP in a dnsbl used by SA, then I just remove that dnsbl. Sounds like we need more non-us based corpus contributors. After all, the SA devs can only work with what they get. Also, bear in mind that SpamAssassin's creator, Justin Mason, isn't based in the US. Last I checked he was in Ireland. Unfortunately this doesn't help with the encoding issue, as they still use ordinary English characters over there for most things. (I don't think Gaelic is very common in email.) So bear in mind that SA isn't just "developed in the US by US citizens for US markets". oh, I never meant that. However, it is true that the vast majority of the corpus currently comes from folks who speak English (King's or Yankee) as a primary language, and that's a bit of a problem as it creates considerable bias in the rules. And even us US folks do have encoding issues. After all, English is not our official language here in the US, what do you mean here? what would be your official language? and I've got plenty of users that speak multiple languages, not all of which use plain-ascii. I guess so. now I'm not sure our situation isn't worst because people tried to find non standard solutions that are still used. I still remember the days when some customers were asking us to "fix" our software because "it broke their accents"... hopefully these times are gone, but I still see "broken" mail (much more than I should). actually, I also see mail that doesn't get rendered correctly on thunderbird. so I'll admit that the issue isn't really about accented chars...
Re: xxxl spam
mouss wrote: > I also understand that US guys may get less encoded subjects, but at least in > .fr, we have that all the time (because of our accented letters, and because > many companies still use software that predates mime). and if I find a > legitimate IP in a dnsbl used by SA, then I just remove that dnsbl. Sounds like we need more non-us based corpus contributors. After all, the SA devs can only work with what they get. Also, bear in mind that SpamAssassin's creator, Justin Mason, isn't based in the US. Last I checked he was in Ireland. Unfortunately this doesn't help with the encoding issue, as they still use ordinary English characters over there for most things. (I don't think Gaelic is very common in email.) So bear in mind that SA isn't just "developed in the US by US citizens for US markets". However, it is true that the vast majority of the corpus currently comes from folks who speak English (King's or Yankee) as a primary language, and that's a bit of a problem as it creates considerable bias in the rules. And even us US folks do have encoding issues. After all, English is not our official language here in the US, and I've got plenty of users that speak multiple languages, not all of which use plain-ascii.
Re: xxxl spam
John Rudd wrote: I wouldn't do that. Please note that I "said it the short" way. I of course don't jump to disable rules. I do check whether the message should have been flagged as spam (a "reasonable" FP). if so, that's life. If possible, I see if I can create a rule to make it get hammed without breaking the whole filter. If however, the tests that made it classify as spam are not clear to me, then I check if I can lower some. but some tests just get disabled. Just because legitimate mail triggers some rule doesn't mean that the rule is flawed. Using your example, triggering "no_real_name" does not mean that the message is spam, it means that the message has _some_ similarity to at least some spam messages (the higher the score, the stronger the similarity). And, that's absolutely true: statistically, when looking at the corpus which was used to create the rules database, a higher percentage of "no_real_name" messages were spam. As I already said in another thread, the statistics results depend on the attributes you are checking. the perceptron will not wake up and say "hey, come on, this attribute is not good". so, if you run a mass check with rules like: - IP parity - first letter of sender - mailer: "the bat" for instance - relay = comcast, free.fr, ... ... then the perceptron will give you what you asked for: scores. I also understand that US guys may get less encoded subjects, but at least in .fr, we have that all the time (because of our accented letters, and because many companies still use software that predates mime). and if I find a legitimate IP in a dnsbl used by SA, then I just remove that dnsbl. Now, if legit messages were not just triggering those rules, but also triggering enough rules to be flagged as spam ... then I would lower the value of those rules, but not disable those rules. I disable the rules, and if I get false negatives, I see what I can do. up so far, (the very few) missed spam would have been missed anyway. But I would only do that if I could see that there was a large percentage of should-be-ham messages being flagged as spam by that rule AND that rule wasn't being useful in flagging spam messages. The reason is: if the message is being flagged, but it shouldn't have been, then perhaps my "corpus" of messages differs significantly enough from the SA internal corpus that my score values need to be different. But that doesn't mean that the rules are so disjoint from tracking spam that they should be entirely disabled. They just don't have the same weighting that my corpus needs. If, instead, most messages passing through my mail servers, that triggered that rule, really did seem to be spam, then I wouldn't alter the score at all. I would just pass the should-have-been-ham message into my bayesian learner and hope that a low bayes score for messages like that would offset the rules had flagged it as spam. everybody has its own situation. I am very FP sensitive. I prefer to get spam than to lose an important mail. after all, I do review my spam. so the less FPs there are, the faster I can review my junk folder.
Re: xxxl spam
On Apr 13, 2006, at 9:56 AM, mouss wrote: I am also seing many legit mail trigering some SA rules (*_exess, no_real_name, x_library, ...). when I see this, I check the rule, and if I can't find a justification, I disable it. I wouldn't do that. Just because legitimate mail triggers some rule doesn't mean that the rule is flawed. Using your example, triggering "no_real_name" does not mean that the message is spam, it means that the message has _some_ similarity to at least some spam messages (the higher the score, the stronger the similarity). And, that's absolutely true: statistically, when looking at the corpus which was used to create the rules database, a higher percentage of "no_real_name" messages were spam. Now, if legit messages were not just triggering those rules, but also triggering enough rules to be flagged as spam ... then I would lower the value of those rules, but not disable those rules. But I would only do that if I could see that there was a large percentage of should-be-ham messages being flagged as spam by that rule AND that rule wasn't being useful in flagging spam messages. The reason is: if the message is being flagged, but it shouldn't have been, then perhaps my "corpus" of messages differs significantly enough from the SA internal corpus that my score values need to be different. But that doesn't mean that the rules are so disjoint from tracking spam that they should be entirely disabled. They just don't have the same weighting that my corpus needs. If, instead, most messages passing through my mail servers, that triggered that rule, really did seem to be spam, then I wouldn't alter the score at all. I would just pass the should-have-been-ham message into my bayesian learner and hope that a low bayes score for messages like that would offset the rules had flagged it as spam.
Re: xxxl spam
John Rudd wrote: While I don't disagree with your assessment of XP systems, I have a different hunch about why such a large percentage of the mail coming from XP systems is spam, and a smaller percentage of mail coming from the other systems is spam: a) In general, XP systems are not servers, and therefore, are not mail servers. b) Due to (a), if you do your mail/spam/virus scanning on machines that do not receive direct connections from your own clients (mail/spam/virus scanning at the border), OR if you do not have a high percentage of XP clients in your domain, then your scanning systems will not receive many (if any) legitimate direct connections from XP clients ... because a legitimate mail sending process on an XP system will be directly connecting to their own domain's mail server, and not to YOUR mail scanning systems. c) Thus, if you meed the conditions in (b), and if we accept (a) as true, then the vast majority of connections you receive from XP systems, on your mail scanning systems, will be from spam/virus bots trying to directly submit spam or virus laden messages to your mail gateways instead of submitting it to their own mail servers (as bots are known to do). We would expect to see a lower percentage of spam from server type OSes (or OSes that can be clients or servers) because a higher percentage of those platforms are used as legitimate mail servers. The other factor here is: while I _hate_ linux, how much of the spam being submitted by linux boxes is merely a mail server relaying on behalf of one of their infected clients? (same with the unix systems, and the 2000/2003 systems) And thus not at all indicative of the quality of linux systems administration out on the internet. I think this is one of those cases where "the statistics work as blind observations of behavior, but attempting to describe _why_ the statistics works is not something you can sum up with a simple an straight forward explanation". Kinda like QM. I agree that statistics aren't the whole story. you can study the percentage of thiefs/criminals based on skin color and origin (some people already do it, and many jump to conclusions without studies). but you can do the same study based on social situation and past history of people. the first "researcher" will probably conclude that black/arabic/latin/... people are "more" criminal. the second "researcher" will instead conclude that criminality is more seen in poor communities, but that these aren't the worst criminals (killing vs stealing for instance). back to xp and co. my feeling (no, I didn't run a study and won't) is that even if any study would show that we get more spam from XP than from linux, I will not use this to classify my mail. I am certain that if you do stats on mail date, you'll find that some dates correspond to more spam than others. we've already seen people jumping to block specific mailers (the bat for instance) based on their stats. I am also seing many legit mail trigering some SA rules (*_exess, no_real_name, x_library, ...). when I see this, I check the rule, and if I can't find a justification, I disable it.
Re: xxxl spam
On Apr 13, 2006, at 12:12 AM, Loren Wilton wrote: I'd like to venture the suggestion that the percentage of spam from XP isn't necessarily an indication of inherent buggyness. It is more an indication that it is an OS for Clueless Noobs who haven't a clue about maintaining a system, avoiding a virus, or even able to tell if they have a viruis. Thes are the machines that turn into zombies. While I don't disagree with your assessment of XP systems, I have a different hunch about why such a large percentage of the mail coming from XP systems is spam, and a smaller percentage of mail coming from the other systems is spam: a) In general, XP systems are not servers, and therefore, are not mail servers. b) Due to (a), if you do your mail/spam/virus scanning on machines that do not receive direct connections from your own clients (mail/spam/virus scanning at the border), OR if you do not have a high percentage of XP clients in your domain, then your scanning systems will not receive many (if any) legitimate direct connections from XP clients ... because a legitimate mail sending process on an XP system will be directly connecting to their own domain's mail server, and not to YOUR mail scanning systems. c) Thus, if you meed the conditions in (b), and if we accept (a) as true, then the vast majority of connections you receive from XP systems, on your mail scanning systems, will be from spam/virus bots trying to directly submit spam or virus laden messages to your mail gateways instead of submitting it to their own mail servers (as bots are known to do). We would expect to see a lower percentage of spam from server type OSes (or OSes that can be clients or servers) because a higher percentage of those platforms are used as legitimate mail servers. The other factor here is: while I _hate_ linux, how much of the spam being submitted by linux boxes is merely a mail server relaying on behalf of one of their infected clients? (same with the unix systems, and the 2000/2003 systems) And thus not at all indicative of the quality of linux systems administration out on the internet. I think this is one of those cases where "the statistics work as blind observations of behavior, but attempting to describe _why_ the statistics works is not something you can sum up with a simple an straight forward explanation". Kinda like QM.
Re: xxxl spam
Mark Martinec wrote: I guess Windows Server 2003 is reported as Windows 2000, but I don't know. Certainly a couple of very large sites are seen as Windows 2000. In the UNKNOWN category there must be a mix of Windows and Unix hosts, not sure what is unusual about them. Mark Hmm... FWIW: [EMAIL PROTECTED] dos]$ sudo p0f -i eth1 p0f - passive os fingerprinting utility, version 2.0.4 (C) M. Zalewski <[EMAIL PROTECTED]>, W. Stearns <[EMAIL PROTECTED]> p0f: listening (SYN) on 'eth1', 223 sigs (12 generic), rule: 'all'. 24.141.168.241:4218 - Windows XP Pro SP1, 2000 SP3 -> 66.98.221.156:25 (distance 1, link: ethernet/modem) 66.98.221.156:2602 - Windows 2000 SP4, XP SP1 -> 24.141.168.241:783 (distance 19, link: ethernet/modem) 24.141.168.241 is Windows XP Pro SP1 66.98.221.156 is Windows Server 2003 SP1 (Standard Edition) Daryl
Re: xxxl spam
Wolfgang, Loren, > > real mail servers (those that deliver the ham part of mail) rarely ever > > run XP but that this OS is the best candidate for creating a spam zombie > Not completely unreasonable. XP is targeted within MS as a personal or > very small company OS. The equivalent of a linux/unix system used by more > than a single person would typically be some version of Server 2003. Which > was probably identified in the stats as Windows 2000. > > I'd like to venture the suggestion that the percentage of spam from XP > isn't necessarily an indication of inherent buggyness. It is more an > indication that it is an OS for Clueless Noobs who haven't a clue about > maintaining a system, avoiding a virus, or even able to tell if they have a > viruis. Thes are the machines that turn into zombies. I fully agree. In this view the following two lines should be seen as well: p0f OS guessham : spam Linux58.8 % : 41.2 % Unix 80.3 % : 19.7 % Linux is used by masses (compared to other Unix OS types) because it is considered to be easier to set up. Eventually this also means that less care is invested in prevention of being used to propagate spam. Still, a "score L_P0F_Unix -1.0" seems to be doing a good job here. Daryl, > I'm not sure the ham hit rate from the Windows-XP category scales (to > other installations) very well. The last time I looked into using p0f > to fingerprint connecting hosts, last spring, I seem to recall that > Windows XP and Windows 2003 share the same TCP/IP stack and fingerprint > identically. > > While it'd be nice to be score "Windows-XP" hosts harshly, there's a lot > of mail coming from Windows Server 2003 hosts that would get hit. There is indeed a handful of valid small sites classified by p0f as Windows XP from which we do receive regular mail (well, newsletters and such, but still, should be treated mostly as ham). I don't see adding few score points to them much different than other (some quite arbitrary) rules - each rule tries to have low FP rate, but it often is not zero. Only a collection of all rules has merit. > I know for some of my systems 1:99 would be really low if Windows Server > 2003 and XP are identified the same. 40:60 (and in some cases 80:20) > would be closer to what I often see if I were to assume that all spam > came from Windows XP hosts. > Maybe you don't receive much, if any, mail from Windows Server 2003 hosts? I guess Windows Server 2003 is reported as Windows 2000, but I don't know. Certainly a couple of very large sites are seen as Windows 2000. In the UNKNOWN category there must be a mix of Windows and Unix hosts, not sure what is unusual about them. Mark
Re: xxxl spam
> to read this in other words: while certain analysts (and definitlely microsoft marketing) > claim that about 50 % of all servers is running windows, these figures tend to say that > real mail servers (those that deliver the ham part of mail) rarely ever run XP > but that this OS is the best candidate for creating a spam zombie Not completely unreasonable. XP is targeted within MS as a personal or very small company OS. The equivalent of a linux/unix system used by more than a single person would typically be some version of Server 2003. Which was probably identified in the stats as Windows 2000. I'd like to venture the suggestion that the percentage of spam from XP isn't necessarily an indication of inherent buggyness. It is more an indication that it is an OS for Clueless Noobs who haven't a clue about maintaining a system, avoiding a virus, or even able to tell if they have a viruis. Thes are the machines that turn into zombies. If there were as many linux machines in the hands of Clueless Noobs, I'd bet that the number of infected linux systems would be in the similar percentage range. Remember, these XP systems are virtually all run with Administrator (aka root) privs all the time, by people that haven't a clue what that means. What would happen if all linux-like systems ran that way?) Loren
Re: xxxl spam
Mark Martinec wrote: The most interesting part in my view is not the IP distance, but the type of OS, illustrated by the following table (derived from the same data as fig2): p0f OS guessham : spam - Windows-XP0.7 % : 99.3 % Windows-2000 5.8 % : 94.2 % UNKNOWN 16.5 % : 83.5 % Linux58.8 % : 41.2 % Unix 80.3 % : 19.7 % (Unix+Linux 66.5 % : 33.5 %) Only 0.7% of all mail coming from Windows-XP hosts is ham!!! It is an ideal information to contribute two or three score points. I'm not sure the ham hit rate from the Windows-XP category scales (to other installations) very well. The last time I looked into using p0f to fingerprint connecting hosts, last spring, I seem to recall that Windows XP and Windows 2003 share the same TCP/IP stack and fingerprint identically. While it'd be nice to be score "Windows-XP" hosts harshly, there's a lot of mail coming from Windows Server 2003 hosts that would get hit. I know for some of my systems 1:99 would be really low if Windows Server 2003 and XP are identified the same. 40:60 (and in some cases 80:20) would be closer to what I often see if I were to assume that all spam came from Windows XP hosts. Maybe you don't receive much, if any, mail from Windows Server 2003 hosts? Daryl
Re: xxxl spam
Hi, to read this in other words: while certain analysts (and definitlely microsoft marketing) claim that about 50 % of all servers is running windows, these figures tend to say that real mail servers (those that deliver the ham part of mail) rarely ever run XP but that this OS is the best candidate for creating a spam zombie Wolfgang Hamann p0f OS guessham : spam - Windows-XP0.7 % : 99.3 % Windows-2000 5.8 % : 94.2 % UNKNOWN 16.5 % : 83.5 % Linux58.8 % : 41.2 % Unix 80.3 % : 19.7 % (Unix+Linux 66.5 % : 33.5 %) Only 0.7% of all mail coming from Windows-XP hosts is ham!!! It is an ideal information to contribute two or three score points.
Re: xxxl spam
Justin, > Mark Martinec writes: > > As a curiosity (but off topic), harvesting results from p0f > > (passive operating system fingerprinting), here are two more: > > http://www.ijs.si/software/amavisd/fig1.gif > > Spam score vs. IP distance in hops (our server is > > in European academic network Geant) > > And perhaps most interesting of all (by again OT): > > http://www.ijs.si/software/amavisd/fig2.gif > > Spam score distribution as a percentage of all mail, > > separate by each sending mail client's operating system. > That's excellent data! Mind if I forward that around to another > list or two? I don't mind. > The "hops" measurement is particularly interesting. Have you got that > implemented as a working rule, in the field? is it expensive? Yes, implemented in the field - comes with the latest amavisd-new-2.4.0. It inserts one header field with collected information into mail header, making it available to SA to score it as it wishes (custom rules, bayes). It could probably just as well be implemented as a SA plugin (making use of the supplied lightweight p0f-analyzer.pl interface to p0f), but it was easier for me to do it in amavisd-new, where remote SMTP client's IP address is accessible directly, not needing to parse header and understand topology. It is reasonably inexpensive: cost of running p0f utility is comparable to running tcpdump, it takes about one hour CPU per month on our medium-busy mailer, the rest is negligible, no additional latencies and no additional network traffic. The most interesting part in my view is not the IP distance, but the type of OS, illustrated by the following table (derived from the same data as fig2): p0f OS guessham : spam - Windows-XP0.7 % : 99.3 % Windows-2000 5.8 % : 94.2 % UNKNOWN 16.5 % : 83.5 % Linux58.8 % : 41.2 % Unix 80.3 % : 19.7 % (Unix+Linux 66.5 % : 33.5 %) Only 0.7% of all mail coming from Windows-XP hosts is ham!!! It is an ideal information to contribute two or three score points. Traffic from own PC clients must not be seen by p0f, otherwise one would be penalizing site's own user. This can be achieved by either separating MSA from MTA, or using list of internal IP networks for exclusion. A quick summary from amavisd-new-2.4.0 release notes: - experimental support for passive operating system fingerprinting with the use of externally running utility p0f, supplying collected information as a header field to SpamAssassin, making possible to add rules to score SMTP client hosts based on educated guess about their operating system type and IP distance; see below for details; Here are the installation details: - passive operating-system fingerprinting (p0f) support lets SA gain information about SMTP client's operating system and estimated IP distance, and can reduce the number of bounces: * find and install the p0f utility: http://lcamtuf.coredump.cx/p0f.shtml or in FreeBSD ports collection as 'net-mgmt/p0f'; * start a p0f process on the same host where MTA (MX) is running, making it listen only to incoming TCP sessions (to reduce its workload) to the IP address and TCP port (25) where MTA is accepting incoming mail from outside (it doesn't hurt to let it see other traffic too, it just isn't needed); after testing p0f alone and seeing that it works, you may start it up, feeding its output to program p0f-analyzer.pl that comes with amavisd-new package, e.g.: p0f -l 'tcp dst port 25' 2>&1 | p0f-analyzer.pl 2345 & on multi-homed boxes one may need to specify interface and IP address where MTA is listening, the filter syntax is the same as in tcpdump, e.g.: p0f -l -i bge0 'dst host 192.0.2.66 and tcp dst port 25' 2>&1 \ | p0f-analyzer.pl 2345 & * the program p0f-analyzer.pl reads p0f reports on stdin, keeps a cache for a limited time (10 minutes, configurable) of data about incoming TCP sessions organized by remote IP address, and listens on UDP port 2345 (specified as its command line argument) for queries; only queries from allowed IP addresses are accepted and responded to, other queries are silently ignored - configure @inet_acl accordingly, defaults to 127.0.0.1; * adding the following line to amavisd.conf, matching the chosen port number to the one specified on the command line to the p0f-analyzer.pl: $os_fingerprint_method = 'p0f:127.0.0.1:2345'; makes amavisd send queries to p0f-analyzer.pl (on the supplied IP address and UDP port number) to collect information about remote SMTP client's OS; collected response is then supplied as a header field when SpamAssassin is invoked; query/response is very quick and imposes no burden on amavisd process nor does its extend its processing time. The $os_fingerprint_method setting is also a member of policy banks to make it more flexible to
Re: xxxl spam
That's excellent data! Mind if I forward that around to another list or two? The "hops" measurement is particularly interesting. Have you got that implemented as a working rule, in the field? is it expensive? --j. Mark Martinec writes: > mouss wrote: > > since most filters skip large messages, it may be tempting for spammers > > to send large messagess: > > I did some statistical analysis few weeks ago with SA 3.1.1 > (SA called from amavisd-new, but that is beside the point). > > Please see: > > http://www.ijs.si/software/amavisd/fig4.gif > Shows spam score vs. mail size as a scattergram > > http://www.ijs.si/software/amavisd/fig5.gif > Shows elapsed time for mail checking vs. mail size > (shown is total time, but >90% of it reflects processing > within SA and its plugins) > > As a curiosity (but off topic), harvesting results from p0f > (passive operating system fingerprinting), here are two more: > > http://www.ijs.si/software/amavisd/fig1.gif > Spam score vs. IP distance in hops (our server is > in European academic network Geant) > > And perhaps most interesting of all (by again OT): > > http://www.ijs.si/software/amavisd/fig2.gif > Spam score distribution as a percentage of all mail, > separate by each sending mail client's operating system. > > Mark
Re: xxxl spam
Theo Van Dinter writes: > On Tue, Apr 11, 2006 at 02:14:26PM -0400, Matt Kettler wrote: > > Well, SA automatically ignores attachments in recent versions. However, > > hash-based plugins like razor, dcc, and pyzor work best when seeing all the > > attachments. > > For completeness, the first sentence isn't exactly true. > SA "automatically ignores attachments" for the standard set of body, > header, and uri rules, but it still has to read in the data, store it in > the message tree internally, and make the entire message text available > for full rules. > > There are also things like the AntiVirus plugin, etc, which may go ahead > and decode attachments and do things with the data. I could easily see > a plugin for ClamAV, or something scanning image files, etc. > > I think that at some point, the default size could go up, but I wouldn't > try it for now. Matt Sergeant had a good trick in the qpsmtpd SpamAssassin plugin iirc -- it would download the entire message, but after a certain point (e.g. 250k) it would stop writing the incoming data to memory, and instead flush the remainder to a temporary file on disk. That way it could keep only the first 250k of messages, scanning that part, and once complete, reassemble the whole message as it wrote it back out. However there may be issues there -- e.g. consider a multipart/alternative message containing an innocent-looking 600k text/plain, followed by a 10k text/html spam payload. Common MUAs would display the latter, SpamAssassin would scan the former. That seems to be a vulnerability to me, although we already don't scan large messages _anyway_ ;) Also as Theo said, it fails in the face of any kind of message-body rewriting by SpamAssassin. --j.
RE: greetpause was Re: xxxl spam
On Tuesday, April 11, 2006 1:37 PM -0700 [EMAIL PROTECTED] wrote: Agreed. Spammers have access to all the free CPU bandwidth and processing time they can steal - legitimate MTAs are limited to a budget. Any anti-spam solution that simply rewards CPU and bandwidth spent* is playing into the hands of the spammers. The original concern was that spammers would use larger messages to avoid the size cutoff in SA, but this was countered because spammers have to reduce their message rate to send larger messages. Server-side, GreetPause (and greylisting) forces a client to reduce its message rate. If the client has unlimited bandwidth and doesn't care about the reduced message rate, it might as well shovel giant messages. In for a penny, in for a pound.
RE: greetpause was Re: xxxl spam
mouss wrote: > so greetpause will certainly stop some ratware spam, but is not a > "full" solution. Agreed. Spammers have access to all the free CPU bandwidth and processing time they can steal - legitimate MTAs are limited to a budget. Any anti-spam solution that simply rewards CPU and bandwidth spent* is playing into the hands of the spammers. * Email stamps, "factor this product of large primes" challanges, greetpause -- Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902 Hispanic Business Inc./HireDiversity.com Software Engineer
Re: greetpause was Re: xxxl spam
Mike Jackson wrote: You can also impose this cost on spammers by enabling the GreetPause feature in the more recent versions of sendmail. This tells sendmail not to answer right away when receiving a connection, and to drop the connection if anything is received before the greeting is sent out. This punishes "slammer" spammers who push the whole SMTP conversation through and then disconnect. It also ensures that every connection from an unknown sender takes a minimum amount of time. You can add exceptions in your access database for your customers and frequent correspondents. For example, this exception drops the GreetPause to zero for my LAN (example is for 10.123/16): GreetPause:10.123 0 Is this as effective as greylisting? Perhaps not, but it also doesn't have any of the drawbacks (ie, delayed mail, need to whitelist non-behaving servers, etc.). I recently enabled it on my servers, and it's been stopping a ton of mail without any complaints from legitimate senders. greetpause only blocks some ratware spam. If I was to write spam and/or viruses, I would just add a sleep(x): given N victims, choose M among them: for i=0; iso greetpause will certainly stop some ratware spam, but is not a "full" solution. also, if your greetpause requires sleep()-ing on every connection, then it's not acceptable (for me) as this is a call for DoS. I am not aware of any async MTA [read: one that will not sleep, but will handle other connections in the meantime], at least in the open source world. If you are after "miscreants", then partial-greylisting is probably more effective (I mean greylisting some of the connections, based on the client name, ip, behaviour, ... etc).
Re: greetpause was Re: xxxl spam
You can also impose this cost on spammers by enabling the GreetPause feature in the more recent versions of sendmail. This tells sendmail not to answer right away when receiving a connection, and to drop the connection if anything is received before the greeting is sent out. This punishes "slammer" spammers who push the whole SMTP conversation through and then disconnect. It also ensures that every connection from an unknown sender takes a minimum amount of time. You can add exceptions in your access database for your customers and frequent correspondents. For example, this exception drops the GreetPause to zero for my LAN (example is for 10.123/16): GreetPause:10.123 0 Is this as effective as greylisting? Perhaps not, but it also doesn't have any of the drawbacks (ie, delayed mail, need to whitelist non-behaving servers, etc.). I recently enabled it on my servers, and it's been stopping a ton of mail without any complaints from legitimate senders.
greetpause was Re: xxxl spam
Kenneth Porter wrote: > You can also impose this cost on spammers by enabling the GreetPause > feature in the more recent versions of sendmail. This tells sendmail not > to answer right away when receiving a connection, and to drop the > connection if anything is received before the greeting is sent out. This > punishes "slammer" spammers who push the whole SMTP conversation through > and then disconnect. It also ensures that every connection from an > unknown sender takes a minimum amount of time. You can add exceptions in > your access database for your customers and frequent correspondents. For > example, this exception drops the GreetPause to zero for my LAN (example > is for 10.123/16): > > GreetPause:10.123 0 Is this as effective as greylisting? -- Mr Michele Neylon Blacknight Solutions Quality Business Hosting & Colocation http://www.blacknight.ie/ Tel. 1850 927 280 Intl. +353 (0) 59 9183072 Direct Dial: +353 (0)59 9183090 Fax. +353 (0) 59 9164239
Re: xxxl spam
On Tuesday, April 11, 2006 2:14 PM -0400 Matt Kettler <[EMAIL PROTECTED]> wrote: I've not seen it with dummy text, but I have seen the large image spam. However, it's very rare. The problem being that if you're a large-volume spammer, large messages take a longer time to send, and thus reduce your spams/minute. You can also impose this cost on spammers by enabling the GreetPause feature in the more recent versions of sendmail. This tells sendmail not to answer right away when receiving a connection, and to drop the connection if anything is received before the greeting is sent out. This punishes "slammer" spammers who push the whole SMTP conversation through and then disconnect. It also ensures that every connection from an unknown sender takes a minimum amount of time. You can add exceptions in your access database for your customers and frequent correspondents. For example, this exception drops the GreetPause to zero for my LAN (example is for 10.123/16): GreetPause:10.123 0
Re: xxxl spam
mouss wrote: > since most filters skip large messages, it may be tempting for spammers > to send large messagess: I did some statistical analysis few weeks ago with SA 3.1.1 (SA called from amavisd-new, but that is beside the point). Please see: http://www.ijs.si/software/amavisd/fig4.gif Shows spam score vs. mail size as a scattergram http://www.ijs.si/software/amavisd/fig5.gif Shows elapsed time for mail checking vs. mail size (shown is total time, but >90% of it reflects processing within SA and its plugins) As a curiosity (but off topic), harvesting results from p0f (passive operating system fingerprinting), here are two more: http://www.ijs.si/software/amavisd/fig1.gif Spam score vs. IP distance in hops (our server is in European academic network Geant) And perhaps most interesting of all (by again OT): http://www.ijs.si/software/amavisd/fig2.gif Spam score distribution as a percentage of all mail, separate by each sending mail client's operating system. Mark
Re: xxxl spam
Theo Van Dinter wrote: > On Tue, Apr 11, 2006 at 02:46:41PM -0400, Matt Kettler wrote: >> Of course, this can't work if you're using any kind of encapsulation options >> in >> report_safe, but since MailScanner does all the markup itself, it doesn't >> hurt >> it to send Mail::SpamAssassin a truncated version. Converting this to the >> spamc/spamd model might be kind of difficult due to this, but it's worth >> considering for spamc -c. > > It's been suggested before, but it doesn't quite work for SA > unfortunately. SA is designed to be a generic mail filter, and some > rules/plugins/etc expect to be able to see the entire original contents > of the message, so we can't really trim off pieces. Also, things like > spamc have no concept of what a message actually is, they just read in > a bunch of data and send it somewhere, so the full message would have to > be read in by spamd before anything could be trimmed off of it. At that > point there's not a lot of savings in trimming off attachments (though the > raw versions could potentially be stored in temp files instead of memory). > > And then, as you said, with encapsulation and such, we'd need the whole of the > message anyway. Agreed.. the only part of sa that this would be straightforward for would be spamc -c. At that point, spamc isn't piping the message back out, and isn't doing encapsulation, so truncation would be irrelevant.
Re: xxxl spam
On Tue, Apr 11, 2006 at 02:46:41PM -0400, Matt Kettler wrote: > Of course, this can't work if you're using any kind of encapsulation options > in > report_safe, but since MailScanner does all the markup itself, it doesn't hurt > it to send Mail::SpamAssassin a truncated version. Converting this to the > spamc/spamd model might be kind of difficult due to this, but it's worth > considering for spamc -c. It's been suggested before, but it doesn't quite work for SA unfortunately. SA is designed to be a generic mail filter, and some rules/plugins/etc expect to be able to see the entire original contents of the message, so we can't really trim off pieces. Also, things like spamc have no concept of what a message actually is, they just read in a bunch of data and send it somewhere, so the full message would have to be read in by spamd before anything could be trimmed off of it. At that point there's not a lot of savings in trimming off attachments (though the raw versions could potentially be stored in temp files instead of memory). And then, as you said, with encapsulation and such, we'd need the whole of the message anyway. -- Randomly Generated Tagline: "NT is secure as long as you don't remove the shrink wrap." - G. Myers pgpcoEruv9FrR.pgp Description: PGP signature
Re: xxxl spam
Theo Van Dinter wrote: > On Tue, Apr 11, 2006 at 02:14:26PM -0400, Matt Kettler wrote: >> Well, SA automatically ignores attachments in recent versions. However, >> hash-based plugins like razor, dcc, and pyzor work best when seeing all the >> attachments. > > For completeness, the first sentence isn't exactly true. > SA "automatically ignores attachments" for the standard set of body, > header, and uri rules, but it still has to read in the data, store it in > the message tree internally, and make the entire message text available > for full rules. Fair enough... > There are also things like the AntiVirus plugin, etc, which may go ahead > and decode attachments and do things with the data. I could easily see > a plugin for ClamAV, or something scanning image files, etc. > > I think that at some point, the default size could go up, but I wouldn't > try it for now. FWIW, it might be worth considering the approach used by MailScanner. MailScanner still scans large messages, but truncates messages over "Max SpamAssassin Size". Presumably it does in a manner that still has the correct mime boundaries, because I don't get any kind of superflous rule hits regarding mime boundaries on large messages. I've currently got this set to 60k, but MailScanner defaults to 30k. Of course, this can't work if you're using any kind of encapsulation options in report_safe, but since MailScanner does all the markup itself, it doesn't hurt it to send Mail::SpamAssassin a truncated version. Converting this to the spamc/spamd model might be kind of difficult due to this, but it's worth considering for spamc -c.
Re: xxxl spam
On Tue, Apr 11, 2006 at 02:14:26PM -0400, Matt Kettler wrote: > Well, SA automatically ignores attachments in recent versions. However, > hash-based plugins like razor, dcc, and pyzor work best when seeing all the > attachments. For completeness, the first sentence isn't exactly true. SA "automatically ignores attachments" for the standard set of body, header, and uri rules, but it still has to read in the data, store it in the message tree internally, and make the entire message text available for full rules. There are also things like the AntiVirus plugin, etc, which may go ahead and decode attachments and do things with the data. I could easily see a plugin for ClamAV, or something scanning image files, etc. I think that at some point, the default size could go up, but I wouldn't try it for now. -- Randomly Generated Tagline: Zoidberg: That's where I'm meeting Uncle Zoid for lunch to discuss my Hollywood dream. The next time you see me, don't be surprised if I've eaten. pgp3qHm0nZQ6E.pgp Description: PGP signature
Re: xxxl spam
mouss wrote: > since most filters skip large messages, it may be tempting for spammers > to send large messagess: > > - using a large but "invisible" part (either by using mime and putting a > large text part in an alternative mime, or using "invisible" chars > before their own text). > > - using a large image > > - large "tail" (spammers can append anything). > > - "unused" attachments > > questions: > - has this already been seen? I've not seen it with dummy text, but I have seen the large image spam. However, it's very rare. The problem being that if you're a large-volume spammer, large messages take a longer time to send, and thus reduce your spams/minute. There's only one spammer that's done this to me. There's some group of stores in Guatemala that sends me high-res scans of their newspaper. Consejeros en Finanzas Empresariales, some kind of bank La Cuacao - some kind of electronics shop? or an eye doctor? cefesa hardware - a True Value hardware store. Why anyone in Guatemala thinks I'll visit their store to spend "Q. 22" on a patio log fake fire log or "Q. 85" on a generic brand weed and feed fertilizer is beyond me. But other than these guys, I don't get any spams >250kb. > - how can we mitigate this? Personally, I think it is largely self-mitigating. Their size greatly limits their potential distribution. As I see it, there's very little large-spam out there. > > > my first thought would be to "process" the message before passing it to > the filter. In particular, are there drawbacks/benefits if I remove > attachments before passing them to SA (or any other filter)? Well, SA automatically ignores attachments in recent versions. However, hash-based plugins like razor, dcc, and pyzor work best when seeing all the attachments.