Hi,
Python has the built-in email package which could be useful for you at
least for the multipart stuff and the metadata.
http://docs.python.org/2/library/email-examples.html
http://docs.python.org/3/library/email-examples.html
On how to construct features, it depends on what you need to do - when I
was working with mining e-mail data, the important part was handling
message threads and resolving replies, and also taking into account
sender/recipient network features.
-- Tadej
On 11/25/2013 12:33 PM, abhishek wrote:
a simple way of cleaning the html tags is using NLTK's "clean_html"
On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler
<jaquesgrob...@gmail.com <mailto:jaquesgrob...@gmail.com>> wrote:
Hey Florian,
So you need some lexical analyzer to remove all the HTML tags etc
before you start your classification?
I'm not sure about any ready-to-use packages for this (I'm sure
they're out there),
but I've played around with pythons `re` module at some point and
now found this which might be useful to you, if you want to make
your own lexical analyzer for your purposes.
http://www.gooli.org/blog/a-simple-lexer-in-python/
Anyway I hope this is helpful in some way.
Good luck and kind Regards,
Jaq
2013/11/24 Florian Lindner <mailingli...@xgm.de
<mailto:mailingli...@xgm.de>>
Hello,
I want to use scikit-lean for mail classification (no spam
detection). I
haven't really worked with machine learning software (besides
end-user
spamfilters).
What I have done so far:
vectorizer = TfidfVectorizer(input='filename',
preprocessor=mail_preprocessor,
decode_error="ignore")
X = vectorizer.fit_transform(["testmail2"])
testmail2 is raw email message (taken from a servers maildir), The
decode_error I've set due to utf8 decoding issues that I
decided to ignore for
the time being.
This works perfectly for the scikit-learn part. But one
challenge (for me)
seems to be to prepare the mail for feature extraction.
My idea would be to take the plain/text parts of the mails,
maybe additionally
the From header.
def mail_preprocessor(str):
msg = email.message_from_string(str)
msg_body = ""
for part in msg.walk():
if part.get_content_type() == "text/plain":
msg_body += part.get_payload(decode=True)
msg_body = msg_body.lower()
msg_body = msg_body.replace("\n", " ")
msg_body = msg_body.replace("\t", " ")
return msg_body
I know that this may be slightly offtopic and I apologize if
it's too offtopic.
Is there already some code in the wild that prepares mail
messages for feature
extraction? The topic seems to be much more fancy then I had
suspected,
regarding issues like HTML, MIME encodings, multipart stuff, ...
Thanks!
Florian
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech
innovation.
Intel(R) Software Adrenaline delivers strategic insight and
game-changing
conversations that shape the rapidly evolving mobile
landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech
innovation.
Intel(R) Software Adrenaline delivers strategic insight and
game-changing
conversations that shape the rapidly evolving mobile landscape.
Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Regards
Abhishek Thakur
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general