Hey Helge
Funny I just saw this drop into my inbox! Hope you are well.
What does your data look like? Is it sparse? For classification tasks
(read: SGDClassifier), one can stream data one-by-one and thus be
"out-of-core" - though in this case I'd recommend doing it in
"mini-batches". This would use the .partial_fit() method. Also potentially
interesting: https://github.com/mblondel/lightning
Honestly though, if you're talking 1TB of vector data (how many rows of
data roughly?), you will most likely need to do it in parallel. I'm not
sure of all the options in Python land - but something could be done with
Disco (in mapreduce), or IPython parallel, SciDB-py, or take a look at
Spark (http://spark-project.org/). They have a Python API (
http://spark-project.org/docs/latest/python-programming-guide.html) that
can be used to fairly easily operate of distributed datasets - and there is
HDFS integration (which may or may not be an advantage for you :).
Here is a gist example I worked up a while ago using sklearn to do
distributed logistic regression: https://gist.github.com/MLnick/4707012.
Your options here are "iterative parameter mixtures" only, since sklearn
doesn't expose the gradients directly via the API (hence you cannot do
"distributed gradient descent" in which the local gradients are summed and
a gradient step taken on the master).
Also look into Spark's new machine learning library MLlib, which takes care
of K-Means and logistic regression so far (though not yet for sparse
datasets), and will likely have more models coming soon.
N
On Fri, Aug 23, 2013 at 11:18 AM, [email protected] <
[email protected]> wrote:
> Good day,
>
> Can anyone perhaps give me an idea of how large datasets scikit-learn
> algorithms typically can handle?
>
> I have about 4 TB of structured data. I might be able to normalize that
> down to say 1 TB if necessary. The tasks would typically be logistic
> regression, Naive Bayes, k-Means and possible others.
>
> Will scikit-learn algorithms be able to handle this on a fairly powerful
> hardware setup?
>
> At which point does it become necessary to look at distributed ML
> platforms e.g. Mahout instead?
>
> Best regards,
> Helge
>
>
> ------------------------------------------------------------------------------
> Introducing Performance Central, a new site from SourceForge and
> AppDynamics. Performance Central is your source for news, insights,
> analysis and resources for efficient Application Performance Management.
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general