On 2/2/2013 11:01 PM, John Hardin wrote:
On Sat, 2 Feb 2013, Eliezer Croitoru wrote:

Yes I do understand that it's hard.
I worked a bit with perl so I might be able to write something that
will do that if dosn't exists already.

That's probably what it will take.

I will try to explain even more.
The problem is that I get the mail with an example of the SPAM content
which didn't came from EMAIL and just to categorize it as SPAM.
This is not how and for what SA was built for but it gives very good
results in general.
This is a specific case.

Ah, I think I see; by "this is a form" you meant your need is for
scanning content submitted via a web form to see if it is spammy?

Yes..

I have an active system which someone wrote in C# that scans the chars
etc but the problem is that it's in C# and it's an active check that
crawls the site and parsing it rather then a restful system that
triggers the checks when needed.

This is an example of the content:
http://www.fpaste.org/yFOC/

It can be even some CMS post that someone got and he want's to
categorize as spam.

So that sample message is largely hacked up just to provide headers so
that it looks like an email and SA can scan it? That sure doesn't look
like a valid email and there are a lot of obvious spam signs in the
headers.
This msg indeed recognized as spam.
I have other msgs which have:
X-Spam-Status: No, score=3.146 tagged_above=2 required=6.2
        tests=[FROM_ILLEGAL_CHARS=2.059, LOTS_OF_MONEY=0.001,
        RCVD_IN_XBL=0.724, RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=no

And in this case I have a one way filter that actually works.
Language filtering.

I wrote something in ruby which actually works fine as a starter.

#code start
spam_content = "the long part from the mail".force_encoding("Windows-1255")

template_hebrew_chars = 270

def hebrew_char(char)
  if (223..251).member?(char.unpack("H*")[0].hex)
    return true
  elsif (192..203).member?(char.unpack("H*")[0].hex)
     return true
  elsif (205..219).member?(char.unpack("H*")[0].hex)
     return true
  end
  return false
end

counter = 0; spam_content.each_char {|char| if hebrew_char(char);counter += 1 ;end;}

if counter == template_hebrew_chars
  puts "this is a spam"
else
  puts "might not be a spam"
end
##code end

There are couple directions in the identification tree like how many words exist.
If there are mixed hebrew and english words what to decide...
Identify URLs etc.

I have used:
http://msdn.microsoft.com/en-US/goglobal/cc305148
http://en.wikipedia.org/wiki/Windows-1255#Code_page_layout
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt

And maybe later I will try to write something in perl that can help in that.
The mixing of two languages makes it a bit of a problem and I had a nice algorithm in mid to decide on percentage for hebrew language in this encoding.

In another encoding such as UTF-8 or even more complex phonetic languages makes it's a bit difficult but since most simple mails consist of plain text it wont be such a big problem.

Thanks,
--
Eliezer Croitoru
http://www1.ngtech.co.il

Reply via email to