[Scikit-learn-general] Cleaning/feature extraction of e-mail messages

Florian Lindner Sun, 24 Nov 2013 03:56:58 -0800

Hello,

I want to use scikit-lean for mail classification (no spam detection). I 
haven't really worked with machine learning software (besides end-user 
spamfilters).


What I have done so far:

vectorizer = TfidfVectorizer(input='filename', preprocessor=mail_preprocessor, 
decode_error="ignore")
X = vectorizer.fit_transform(["testmail2"])

testmail2 is raw email message (taken from a servers maildir), The 
decode_error I've set due to utf8 decoding issues that I decided to ignore for 
the time being.

This works perfectly for the scikit-learn part. But one challenge (for me) 
seems to be to prepare the mail for feature extraction.

My idea would be to take the plain/text parts of the mails, maybe additionally 
the From header.

def mail_preprocessor(str):
    msg = email.message_from_string(str)
    msg_body = ""
    for part in msg.walk():
        if part.get_content_type() == "text/plain":
            msg_body += part.get_payload(decode=True)
    msg_body = msg_body.lower()
    msg_body = msg_body.replace("\n", " ")
    msg_body = msg_body.replace("\t", " ")
    return msg_body

I know that this may be slightly offtopic and I apologize if it's too offtopic.

Is there already some code in the wild that prepares mail messages for feature 
extraction? The topic seems to be much more fancy then I had suspected, 
regarding issues like HTML, MIME encodings, multipart stuff, ...

Thanks!

Florian

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Cleaning/feature extraction of e-mail messages

Reply via email to