Am Montag, 25. November 2013, 12:33:25 schrieb abhishek:
> a simple way of cleaning the html tags is using NLTK's "clean_html"
Hey,
thx, didn't know about that.
Just for information: this is now be done by BeautifulSoup:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
It will so
Hi,
Python has the built-in email package which could be useful for you at
least for the multipart stuff and the metadata.
http://docs.python.org/2/library/email-examples.html
http://docs.python.org/3/library/email-examples.html
On how to construct features, it depends on what you need to do -
@Florian - Abhishek's suggestion is the way to go. Simple and works well [?]
2013/11/25 abhishek
> a simple way of cleaning the html tags is using NLTK's "clean_html"
>
>
> On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler
> wrote:
>
>> Hey Florian,
>>
>> So you need some lexical analyzer to re
a simple way of cleaning the html tags is using NLTK's "clean_html"
On Mon, Nov 25, 2013 at 12:30 PM, Jaques Grobler wrote:
> Hey Florian,
>
> So you need some lexical analyzer to remove all the HTML tags etc before
> you start your classification?
> I'm not sure about any ready-to-use packages
Hey Florian,
So you need some lexical analyzer to remove all the HTML tags etc before
you start your classification?
I'm not sure about any ready-to-use packages for this (I'm sure they're out
there),
but I've played around with pythons `re` module at some point and now found
this which might be u
Hello,
I want to use scikit-lean for mail classification (no spam detection). I
haven't really worked with machine learning software (besides end-user
spamfilters).
What I have done so far:
vectorizer = TfidfVectorizer(input='filename', preprocessor=mail_preprocessor,
decode_error="ignore")
X