2013/7/6 Tom Fawcett <[email protected]>:
> Hi.  I’m trying to figure out a good general framework for working with text 
> (classification and clustering).  There is an odd intersection of Python 
> packages and no clear way to integrate them optimally:
>
> - NLTK seems like the best at handling natural language.
> - sklearn has the strongest components of learning and evaluation.
> - Pandas is very good for data storage, transformation, and visualization.
>
> Each can do a little of what the others can do, and some integrations exist 
> (pandas and sklearn both use numpy arrays so they’re pretty compatible), but 
> it seems like there’s no clear, good way to integrate them.  It’s very common 
> to want to go from raw text to stemming and n-grams, term frequencies, and 
> finally to TFIDF matrices for learning.  But from my searching, people either 
> stay in one package or write ad hoc glue code to transform the data.
>
> My question: Is there any interface package, or best practices documentation, 
> for using them together to do large-scale text processing?  I can write my 
> own glue code if I have to, but I’d rather not reinvent the wheel.

There is a very useful DataMapper class to map feature extractors to
various heterogeneous panda data frames columns (dates, categorical
variables, un-scaled continuous variables) so as to build a single
homogeneous floating point values numpy array suitable for
scikit-learn:

https://github.com/paulgb/sklearn-pandas

However I am not sure this is really interesting when working with
text data. In that case the source data is often more naturally
expressed as a list of python unicode strings (if the data fits in
memory) or a list of text files names to be read by the text
vectorizer on the fly (use input='filename').

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to