Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Nigel Legg
I'm coming at this from a market research point of view (that's my background). There seem to be a number of opportunities there for classificaton, clustering, and regression analysis tools, so I am building - or rather attempting to build - tools with the aim that they will go on the web, and peo

Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Ian Ozsvald
Hi Nigel. I see you're in the UK, I'm based east of you in London. My goal with the disambiguator is to provide a well documented pipeline such that it can be easily retrained. I have a notion that in the future I'll host a version of my code production-ready under my http://annotate.io/ , ready f

Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Ian Ozsvald
Hi Harold. Are you using different models for the different types of social media? I'd guess that the grammar/terms used in a tweet could look quite different to what you see in e.g. a Google+ Comment (different demographic->probably higher quality English, less space restrictions->longer/clearer w

Re: [Scikit-learn-general] Improving Text Classification

2013-07-11 Thread Nigel Legg
I am just starting down the road towards having a text classifier for social media posts. As this may be used in a variety of situations (currently negotiating 2 freelance analytics positions with research agencies), the classifier will need to have a mechanism for retraining on a project by projec

Re: [Scikit-learn-general] Improving Text Classification

2013-07-11 Thread Harold Nguyen
Hi Ian, Thank you very much for writing this message, and especially for sharing your experience. I am actually doing the very same thing, and would love to collaborate with you, if possible. I'm not as far along in my journey as you are, but I hope we can help each other in the future! I'm categ

Re: [Scikit-learn-general] Improving Text Classification

2013-07-11 Thread Ian Ozsvald
Hello Mike. Could you give a summary of your problem? It sounds like you're categorising text (tweets? medical text? news articles?) into >2 categories (how many?), is that right? Is the goal really to optimise your f1 score, or maybe to only want accurate categorisations (precision) or maybe high

Re: [Scikit-learn-general] Improving Text Classification

2013-07-10 Thread Olivier Grisel
2013/7/10 Mike Hansen : > I have been using Scikit's text classification for several weeks, and I > really like it. I use my own corpus (self-generated) and prepare each > document using the NLTK. Presently I am relying on this tutorial/code-base, > only making changes when absolutely necessary f

Re: [Scikit-learn-general] Improving Text Classification

2013-07-10 Thread Mike Hansen
I have been using Scikit's text classification for several weeks, and I really like it.  I use my own corpus (self-generated) and prepare each document using the NLTK.  Presently I am relying on this tutorial/code-base, only making changes when absolutely necessary for my documents to work. The