Hi. I’m trying to figure out a good general framework for working with text (classification and clustering). There is an odd intersection of Python packages and no clear way to integrate them optimally:
- NLTK seems like the best at handling natural language. - sklearn has the strongest components of learning and evaluation. - Pandas is very good for data storage, transformation, and visualization. Each can do a little of what the others can do, and some integrations exist (pandas and sklearn both use numpy arrays so they’re pretty compatible), but it seems like there’s no clear, good way to integrate them. It’s very common to want to go from raw text to stemming and n-grams, term frequencies, and finally to TFIDF matrices for learning. But from my searching, people either stay in one package or write ad hoc glue code to transform the data. My question: Is there any interface package, or best practices documentation, for using them together to do large-scale text processing? I can write my own glue code if I have to, but I’d rather not reinvent the wheel. Thanks, -Tom ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
