Hi, Danny,

I remembered that sofia-ml minimizes pair-wise ranking errors, so it
might not be the solution for you if AUC is not your evaluation
creteria. In addition, it only supports linear models, is it enough
for your problem?

2010/5/7, Ted Dunning <[email protected]>:
> Glad to hear that you have made good use of Mahout so far.
>
> My recommendations right now for scalable classifiers are generally in the
> SGD area, the canonical example of which is Vowpal Wabbit.  Another
> benchmark implementation is glmnet which does Lasso and elastic band
> regularization.  Vowpal Wabbit will definitely scale to the size you are
> talking about, but truly shines on very large feature spaces.  Glmnet is
> very, very good and very efficient, but assumes an in-core implementation
> right now, thus limiting applicability to your problem.
>
> With only 100 features, my guess is that you can train a main-effects model
> with a relatively small subset of your data, particularly if you have an
> asymmetric target.  You can also use the standard "train-on-errors"
> techniques to augment your original sampled dataset so as to still have a
> small training set which captures what you need out of your larger dataset.
>  This might be particularly helpful if you want to train on interactions.
>
> The general procedure there would be to
>
> a) train a main-effects model on about 1M balanced sample
> b) scan your full dataset and retain about 1M samples that have the worst
> errors
> c) build a fancy new model on the 2M samples
> d) rinse, repeat while AUC improves
>
>
> On Thu, May 6, 2010 at 9:15 AM, Danny Leshem <[email protected]> wrote:
>
>> Hi!
>>
>> I'm currently working on a rather large-scale dataset (~300M samples
>> represented as dense vectors of cardinality ~100).
>> The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs,
>> including heavy usage of Mahout (Lanczos decomposition, clustering, etc).
>>
>> I'm now looking for ways to learn a logistic regression model based on the
>> data.
>> So far I postponed this part of the project, hoping for
>> MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be
>> ready... but unfortunately I can't afford to wait any more :)
>>
>> Looking around, I've found Google's
>> sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley
>> Hadoop-based
>> implementation<
>> http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide>
>> .
>> Anyone has experience with these, or knows of / used a good library for
>> logistic regressions of this scale?
>>
>> Thanks,
>> Danny
>>
>


-- 
Blog of Mahout Chen:
http://blog.sina.com.cn/apachemahout

Reply via email to