Jorn and Nick, Thanks for answering.
Nick, the sparkit-learn project looks interesting. Thanks for mentioning it. Rex On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > You might want to check out https://github.com/lensacom/sparkit-learn > <https://github.com/lensacom/sparkit-learn/blob/master/README.rst> > > Though it's true for random > Forests / trees you will need to use MLlib > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> I fear you have to do the plumbing all yourself. This is the same for all >> commercial and non-commercial libraries/analytics packages. It often also >> depends on the functional requirements on how you distribute. >> >> Le sam. 12 sept. 2015 à 20:18, Rex X <dnsr...@gmail.com> a écrit : >> >>> Hi everyone, >>> >>> What is the best way to migrate existing scikit-learn code to PySpark >>> cluster? Then we can bring together the full power of both scikit-learn and >>> spark, to do scalable machine learning. (I know we have MLlib. But the >>> existing code base is big, and some functions are not fully supported yet.) >>> >>> Currently I use multiprocessing module of Python to boost the speed. But >>> this only works for one node, while the data set is small. >>> >>> For many real cases, we may need to deal with gigabytes or even >>> terabytes of data, with thousands of raw categorical attributes, which can >>> lead to millions of discrete features, using 1-of-k representation. >>> >>> For these cases, one solution is to use distributed memory. That's why I >>> am considering spark. And spark support Python! >>> With Pyspark, we can import scikit-learn. >>> >>> But the question is how to make the scikit-learn code, decisionTree >>> classifier for example, running in distributed computing mode, to benefit >>> the power of Spark? >>> >>> >>> Best, >>> Rex >>> >> >