Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-25 Thread Florian Lindner
Am Montag, 25. November 2013, 12:33:25 schrieb abhishek: > a simple way of cleaning the html tags is using NLTK's "clean_html" Hey, thx, didn't know about that. Just for information: this is now be done by BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text It will so

Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-25 Thread Tadej Stajner
Hi, Python has the built-in email package which could be useful for you at least for the multipart stuff and the metadata. http://docs.python.org/2/library/email-examples.html http://docs.python.org/3/library/email-examples.html On how to construct features, it depends on what you need to do -

Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-25 Thread Jaques Grobler
@Florian - Abhishek's suggestion is the way to go. Simple and works well [?] 2013/11/25 abhishek > a simple way of cleaning the html tags is using NLTK's "clean_html" > > > On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler > wrote: > >> Hey Florian, >> >> So you need some lexical analyzer to re

Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-25 Thread abhishek
a simple way of cleaning the html tags is using NLTK's "clean_html" On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler wrote: > Hey Florian, > > So you need some lexical analyzer to remove all the HTML tags etc before > you start your classification? > I'm not sure about any ready-to-use packages

Re: [Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-25 Thread Jaques Grobler
Hey Florian, So you need some lexical analyzer to remove all the HTML tags etc before you start your classification? I'm not sure about any ready-to-use packages for this (I'm sure they're out there), but I've played around with pythons `re` module at some point and now found this which might be u

[Scikit-learn-general] Cleaning/feature extraction of e-mail messages

2013-11-24 Thread Florian Lindner
Hello, I want to use scikit-lean for mail classification (no spam detection). I haven't really worked with machine learning software (besides end-user spamfilters). What I have done so far: vectorizer = TfidfVectorizer(input='filename', preprocessor=mail_preprocessor, decode_error="ignore") X