You might want to check out https://github.com/lensacom/sparkit-learn





Though it's true for random

Forests / trees you will need to use MLlib



—
Sent from Mailbox

On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> I fear you have to do the plumbing all yourself. This is the same for all
> commercial and non-commercial libraries/analytics packages. It often also
> depends on the functional requirements on how you distribute.
> Le sam. 12 sept. 2015 à 20:18, Rex X <dnsr...@gmail.com> a écrit :
>> Hi everyone,
>>
>> What is the best way to migrate existing scikit-learn code to PySpark
>> cluster? Then we can bring together the full power of both scikit-learn and
>> spark, to do scalable machine learning. (I know we have MLlib. But the
>> existing code base is big, and some functions are not fully supported yet.)
>>
>> Currently I use multiprocessing module of Python to boost the speed. But
>> this only works for one node, while the data set is small.
>>
>> For many real cases, we may need to deal with gigabytes or even terabytes
>> of data, with thousands of raw categorical attributes, which can lead to
>> millions of discrete features, using 1-of-k representation.
>>
>> For these cases, one solution is to use distributed memory. That's why I
>> am considering spark. And spark support Python!
>> With Pyspark, we can import scikit-learn.
>>
>> But the question is how to make the scikit-learn code, decisionTree
>> classifier for example, running in distributed computing mode, to benefit
>> the power of Spark?
>>
>>
>> Best,
>> Rex
>>

Reply via email to