Hi.  I’m trying to figure out a good general framework for working with text 
(classification and clustering).  There is an odd intersection of Python 
packages and no clear way to integrate them optimally:

- NLTK seems like the best at handling natural language.
- sklearn has the strongest components of learning and evaluation.
- Pandas is very good for data storage, transformation, and visualization.

Each can do a little of what the others can do, and some integrations exist 
(pandas and sklearn both use numpy arrays so they’re pretty compatible), but it 
seems like there’s no clear, good way to integrate them.  It’s very common to 
want to go from raw text to stemming and n-grams, term frequencies, and finally 
to TFIDF matrices for learning.  But from my searching, people either stay in 
one package or write ad hoc glue code to transform the data.

My question: Is there any interface package, or best practices documentation, 
for using them together to do large-scale text processing?  I can write my own 
glue code if I have to, but I’d rather not reinvent the wheel.

Thanks,
-Tom


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to