Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

Joel Nothman Sat, 06 Jul 2013 14:07:03 -0700

Sorry, that sent prematurely.

On Sun, Jul 7, 2013 at 6:58 AM, Joel Nothman
<[email protected]>wrote:

> I am not aware of a definitive, complete solution. Lars has built an
> NLTK-compatible classifier interface in nltk.classify.scikitlearn, while
> scikit-learn provides the various components in sklearn.feature_extraction
> that handle text directly, or would allow you to readily produce arrays
> from feature dicts.
>

I don't think there's any clear, generic way for them to interface better:
both systems prefer to interface with native types (dicts, numpy arrays)
rather than sophisticated framework components. (But I'm also not convinced
that NLTK is the right tool for a lot of large-scale feature extraction
jobs.)

I also don't know what data you want to analyse in Pandas: the feature
data? the classification results?

In each of these packages' attempts to remain singular in their purpose and
therefore independent, you only really get occasional blog posts and PyCon
tutorials from the likes of Olivier that tie them together. Frustratingly,
something like
http://www.slideshare.net/ogrisel/statistical-machine-learning-for-text-classification-with-scikitlearn-and-nltkis
rapidly outdated.

I think it would be in scikit-learn's best interests to provide up-to-date
examples of both these interactions, although it means maintaining an
examples package with more external dependencies.

- Joel

> On Sun, Jul 7, 2013 at 2:53 AM, Tom Fawcett <[email protected]> wrote:
>
>> Hi.  I’m trying to figure out a good general framework for working with
>> text (classification and clustering).  There is an odd intersection of
>> Python packages and no clear way to integrate them optimally:
>>
>> - NLTK seems like the best at handling natural language.
>> - sklearn has the strongest components of learning and evaluation.
>> - Pandas is very good for data storage, transformation, and visualization.
>>
>> Each can do a little of what the others can do, and some integrations
>> exist (pandas and sklearn both use numpy arrays so they’re pretty
>> compatible), but it seems like there’s no clear, good way to integrate
>> them.  It’s very common to want to go from raw text to stemming and
>> n-grams, term frequencies, and finally to TFIDF matrices for learning.  But
>> from my searching, people either stay in one package or write ad hoc glue
>> code to transform the data.
>>
>> My question: Is there any interface package, or best practices
>> documentation, for using them together to do large-scale text processing?
>>  I can write my own glue code if I have to, but I’d rather not reinvent the
>> wheel.
>>
>> Thanks,
>> -Tom
>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

Reply via email to