Jorn and Nick,

Thanks for answering.

Nick, the sparkit-learn project looks interesting. Thanks for mentioning it.


Rex


On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> You might want to check out https://github.com/lensacom/sparkit-learn
> <https://github.com/lensacom/sparkit-learn/blob/master/README.rst>
>
> Though it's true for random
> Forests / trees you will need to use MLlib
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I fear you have to do the plumbing all yourself. This is the same for all
>> commercial and non-commercial libraries/analytics packages. It often also
>> depends on the functional requirements on how you distribute.
>>
>> Le sam. 12 sept. 2015 à 20:18, Rex X <dnsr...@gmail.com> a écrit :
>>
>>> Hi everyone,
>>>
>>> What is the best way to migrate existing scikit-learn code to PySpark
>>> cluster? Then we can bring together the full power of both scikit-learn and
>>> spark, to do scalable machine learning. (I know we have MLlib. But the
>>> existing code base is big, and some functions are not fully supported yet.)
>>>
>>> Currently I use multiprocessing module of Python to boost the speed. But
>>> this only works for one node, while the data set is small.
>>>
>>> For many real cases, we may need to deal with gigabytes or even
>>> terabytes of data, with thousands of raw categorical attributes, which can
>>> lead to millions of discrete features, using 1-of-k representation.
>>>
>>> For these cases, one solution is to use distributed memory. That's why I
>>> am considering spark. And spark support Python!
>>> With Pyspark, we can import scikit-learn.
>>>
>>> But the question is how to make the scikit-learn code, decisionTree
>>> classifier for example, running in distributed computing mode, to benefit
>>> the power of Spark?
>>>
>>>
>>> Best,
>>> Rex
>>>
>>
>

Reply via email to