2013/7/6 Tom Fawcett <[email protected]>: > Hi. I’m trying to figure out a good general framework for working with text > (classification and clustering). There is an odd intersection of Python > packages and no clear way to integrate them optimally: > > - NLTK seems like the best at handling natural language. > - sklearn has the strongest components of learning and evaluation. > - Pandas is very good for data storage, transformation, and visualization. > > Each can do a little of what the others can do, and some integrations exist > (pandas and sklearn both use numpy arrays so they’re pretty compatible), but > it seems like there’s no clear, good way to integrate them. It’s very common > to want to go from raw text to stemming and n-grams, term frequencies, and > finally to TFIDF matrices for learning. But from my searching, people either > stay in one package or write ad hoc glue code to transform the data. > > My question: Is there any interface package, or best practices documentation, > for using them together to do large-scale text processing? I can write my > own glue code if I have to, but I’d rather not reinvent the wheel.
There is a very useful DataMapper class to map feature extractors to various heterogeneous panda data frames columns (dates, categorical variables, un-scaled continuous variables) so as to build a single homogeneous floating point values numpy array suitable for scikit-learn: https://github.com/paulgb/sklearn-pandas However I am not sure this is really interesting when working with text data. In that case the source data is often more naturally expressed as a list of python unicode strings (if the data fits in memory) or a list of text files names to be read by the text vectorizer on the fly (use input='filename'). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
