Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

Tadej Stajner Mon, 25 Nov 2013 04:01:26 -0800

Hi,

Python has the built-in email package which could be useful for you atleast for the multipart stuff and the metadata.

http://docs.python.org/2/library/email-examples.html
http://docs.python.org/3/library/email-examples.html

On how to construct features, it depends on what you need to do - when Iwas working with mining e-mail data, the important part was handlingmessage threads and resolving replies, and also taking into accountsender/recipient network features.


-- Tadej

On 11/25/2013 12:33 PM, abhishek wrote:

a simple way of cleaning the html tags is using NLTK's "clean_html"

On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler<jaquesgrob...@gmail.com <mailto:jaquesgrob...@gmail.com>> wrote:


    Hey Florian,

    So you need some lexical analyzer to remove all the HTML tags etc
    before you start your classification?
    I'm not sure about any ready-to-use packages for this (I'm sure
    they're out there),
    but I've played around with pythons `re` module at some point and
    now found this which might be useful to you, if you want to make
    your own lexical analyzer for your purposes.

    http://www.gooli.org/blog/a-simple-lexer-in-python/

    Anyway I hope this is helpful in some way.

    Good luck and kind Regards,
    Jaq


    2013/11/24 Florian Lindner <mailingli...@xgm.de
    <mailto:mailingli...@xgm.de>>

        Hello,

        I want to use scikit-lean for mail classification (no spam
        detection). I
        haven't really worked with machine learning software (besides
        end-user
        spamfilters).

        What I have done so far:

        vectorizer = TfidfVectorizer(input='filename',
        preprocessor=mail_preprocessor,
        decode_error="ignore")
        X = vectorizer.fit_transform(["testmail2"])

        testmail2 is raw email message (taken from a servers maildir), The
        decode_error I've set due to utf8 decoding issues that I
        decided to ignore for
        the time being.

        This works perfectly for the scikit-learn part. But one
        challenge (for me)
        seems to be to prepare the mail for feature extraction.

        My idea would be to take the plain/text parts of the mails,
        maybe additionally
        the From header.

        def mail_preprocessor(str):
            msg = email.message_from_string(str)
            msg_body = ""
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    msg_body += part.get_payload(decode=True)
            msg_body = msg_body.lower()
            msg_body = msg_body.replace("\n", " ")
            msg_body = msg_body.replace("\t", " ")
            return msg_body

        I know that this may be slightly offtopic and I apologize if
        it's too offtopic.

        Is there already some code in the wild that prepares mail
        messages for feature
        extraction? The topic seems to be much more fancy then I had
        suspected,
        regarding issues like HTML, MIME encodings, multipart stuff, ...

        Thanks!

        Florian

        
------------------------------------------------------------------------------
        Shape the Mobile Experience: Free Subscription
        Software experts and developers: Be at the forefront of tech
        innovation.
        Intel(R) Software Adrenaline delivers strategic insight and
        game-changing
        conversations that shape the rapidly evolving mobile
        landscape. Sign up now.
        
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net
        <mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------
    Shape the Mobile Experience: Free Subscription
    Software experts and developers: Be at the forefront of tech
    innovation.
    Intel(R) Software Adrenaline delivers strategic insight and
    game-changing
    conversations that shape the rapidly evolving mobile landscape.
    Sign up now.
    http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Regards

Abhishek Thakur



------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

Reply via email to