Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Eustache DIEMERT Wed, 24 Apr 2013 11:52:51 -0700

Hi Alex,

If I understand correctly you are using 2 different kinds of features :
categorical + ngrams.


In a similar situation but in a classification setting a trick that worked
reasonably well was to train two different models, one feeding the other.

I.e. build a first model out of ngrams/nlp features and pass on the
prediction to a second layer whose input are this prediction plus the other
categorical features.

In my experience I used a Naive Bayes for NLP feats - which is very speedy
and quite memory efficient - and a SGD for integrating with numerical
features (inc. document lengths, length of best match etc).

In your case you could use a SVR based on nlp feats and then a RF regressor
as second layer. From a memory usage perspective, the SVR can eat sparse
matrices, has configurable kernel cache size and only needs to maintain
support vectors so it'll probably match your requirements [1]. If you
want/need you can use the HashingVectorizer [2] to downscale this part to a
fixed nber of features. Then, you pass on to the RF regressor the SVR
output + your "other" categorical features.

I'm assuming that can work pretty well as the RF regressor only needs to
learn the differential between what the SVR based on NLP feats says and the
true output.

If your "other" features are not categorical by nature but rather numeric
(and casted to categorical to integrate nlp features) that could work even
better.

Anyway, let us know what you find :)

Eustache

[1] http://scikit-learn.org/0.13/modules/generated/sklearn.svm.SVR.html
[2]
http://scikit-learn.org/0.13/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html


2013/4/24 Olivier Grisel <[email protected]>

> 2013/4/24 Alex Kopp <[email protected]>:
> > Thanks, guys.
> >
> > Perhaps I should explain what I am trying to do and then open it up for
> > suggestions.
> >
> > I have 203k training examples each with 457k features. The features are
> > composed of one-hot encoded categorical values as well as stemmed, TFIDF
> > weighted unigrams and bigrams (NLP). As you can probably guess, the
> > overwhelming majority of the features are the unigrams and bigrams.
> >
> > In the end, I am looking to build a regression model. I have tried a grid
> > search on SGDRegressor, but have not had any promising results (~0.00 or
> > even negative R^2 values).
> >
> > I would appreciate ideas/suggestions.
>
> Have you tried to plot the histogram of the target variable? If it's
> highly non gaussian (e.g. positive with a large tail) trying to
> predict the log or sqrt might be easier.
>
> Also have you tried a simpler problem such as binary classification:
>
> 1- split your training samples in 3 equal subsets:
>   A: 1/3 of the samples with the biggest outputs,
>   B: 1/3 of the samples with the smallest outputs,
>   C: 1/3 for the remaining samples in the middle.
>
> 2- discard C and train a binary classifier (e.g. gridsearched
> SGDClassifier treating A samples as positive and B samples as
> negative).
>
> If you can get past 55% cross validated accuracy on this problem it
> probably means that your problem is really hard: either the output
> variable is unrelated to the input or the dependency is highly non
> linear.
>
> You can also try to do dimensionality reduction by running
> MinibatchKMeans on the whole dataset with 1000 centroids. Then compute
> the cosine similarity of your samples with those 1000 centroids,
> threshold at zero to get positive values and treat those 1000
> dimensions as new features for your samples.
>
> Then train a random forest on the new features.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Reply via email to