Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Peter Prettenhofer Wed, 24 Apr 2013 06:01:49 -0700

Have you tried tuning the hyper-parameters of the SGDRegressor? You really
need to tune the learning rate for SGDRegressor (SGDClassifier has a pretty
decent default). E.g. set up a grid search w/ a constant learning rate and
try different values of eta0 ([0.1, 0.01, 0.001, 0.0001]). You can also set
verbose=3 to see the loss after each epoch which you can use to check the
convergence.



2013/4/24 Alex Kopp <[email protected]>

> Thanks, guys.
>
> Perhaps I should explain what I am trying to do and then open it up for
> suggestions.
>
> I have 203k training examples each with 457k features. The features are
> composed of one-hot encoded categorical values as well as stemmed, TFIDF
> weighted unigrams and bigrams (NLP). As you can probably guess, the
> overwhelming majority of the features are the unigrams and bigrams.
>
> In the end, I am looking to build a regression model. I have tried a grid
> search on SGDRegressor, but have not had any promising results (~0.00 or
> even negative R^2 values).
>
> I would appreciate ideas/suggestions.
>
> Thanks
>
> ps, if it matters, I have 8 cores and 52gb ram at my disposal.
>
> On Wed, Apr 24, 2013 at 5:32 AM, Peter Prettenhofer <
> [email protected]> wrote:
>
>>
>>
>>
>> 2013/4/24 Olivier Grisel <[email protected]>
>>
>>> 2013/4/24 Peter Prettenhofer <[email protected]>:
>>> > I totally agree with Brian - although I'd suggest you drop option 3)
>>> because
>>> > it will be a lot of work.
>>> >
>>> > I'd suggest you rather should do a) feature extraction or b) feature
>>> > selection.
>>> >
>>> > Personally, I think decision trees in general and random forest in
>>> > particular are not a good fit for sparse datasets - if the average
>>> number of
>>> > non-zero values for each feature is low than your partitions will be
>>> > relatively small - any subsequent splits will make the partitions even
>>> > smaller thus you cannot grow your trees deep since you will run out of
>>> > samples. This means that your tree in fact uses just a tiny fraction
>>> of the
>>> > available features (compared to a deep tree) - unless you have a few
>>> pretty
>>> > strong features or you train lots of trees this won't work out. This is
>>> > probably also the reason why most of the decision tree work in natural
>>> > language processing is done using boosted decision trees of depth one.
>>> If
>>> > your features are boolean than such a model is in fact pretty similar
>>> to a
>>> > simple logistic regression model.
>>> >
>>> > I've the impression that Random Forest in particular is a poor
>>> "evidence
>>> > accumulator" (pooling evidence from lots of weak features) - linear
>>> models
>>> > and boosted trees are much better here.
>>>
>>> Very interesting consideration. Any reference paper to study this in
>>> more details (both theory and empirical validation)?
>>>
>>
>> actually, no - just gut feeling based on how decision trees / RF works
>> (hard non-intersecting partitions) - I will try to digg something up -
>> would definitely like to hear any critics/remarks to my view though.
>>
>>
>>>
>>> Also do you have good paper that demonstrate state of the art results
>>> with boosted stumps for NLP?
>>>
>>
>> I haven't seen any use of boosted stumps in NLP for a while - but maybe I
>> didn't pay close attention - what comes to my mind is some work by Xavier
>> Carreras on NER for CoNLL 2002 (see [1] for an overview of the shared task
>> - actually, a number of participants used boosting/trees).
>> Joseph Turian used boosting in his thesis on parsing [2].
>>
>> [1] http://acl.ldc.upenn.edu/W/W02/W02-2024.pdf
>> [2] http://cs.nyu.edu/web/Research/Theses/turian_joseph.pdf
>>
>>
>>
>>> --
>>> Olivier
>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Try New Relic Now & We'll Send You this Cool Shirt
>>> New Relic is the only SaaS-based application performance monitoring
>>> service
>>> that delivers powerful full stack analytics. Optimize and monitor your
>>> browser, app, & servers with just a few lines of code. Try New Relic
>>> and get this awesome Nerd Life shirt!
>>> http://p.sf.net/sfu/newrelic_d2d_apr
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> --
>> Peter Prettenhofer
>>
>>
>> ------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>> service
>> that delivers powerful full stack analytics. Optimize and monitor your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>> http://p.sf.net/sfu/newrelic_d2d_apr
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest Regression - Large Sparse Data

Reply via email to