I should point out that I'm not sure what the performance of that project is.




I'd expect that native data frame in PySpark will be significantly more 
efficient than their DictRDD. 




It would be interesting to see a performance comparison for the pipelines 
relative to native Spark ML pipelines, if you do test both out.



—
Sent from Mailbox

On Sat, Sep 12, 2015 at 10:52 PM, Rex X <dnsr...@gmail.com> wrote:

> Jorn and Nick,
> Thanks for answering.
> Nick, the sparkit-learn project looks interesting. Thanks for mentioning it.
> Rex
> On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>> You might want to check out https://github.com/lensacom/sparkit-learn
>> <https://github.com/lensacom/sparkit-learn/blob/master/README.rst>
>>
>> Though it's true for random
>> Forests / trees you will need to use MLlib
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> I fear you have to do the plumbing all yourself. This is the same for all
>>> commercial and non-commercial libraries/analytics packages. It often also
>>> depends on the functional requirements on how you distribute.
>>>
>>> Le sam. 12 sept. 2015 à 20:18, Rex X <dnsr...@gmail.com> a écrit :
>>>
>>>> Hi everyone,
>>>>
>>>> What is the best way to migrate existing scikit-learn code to PySpark
>>>> cluster? Then we can bring together the full power of both scikit-learn and
>>>> spark, to do scalable machine learning. (I know we have MLlib. But the
>>>> existing code base is big, and some functions are not fully supported yet.)
>>>>
>>>> Currently I use multiprocessing module of Python to boost the speed. But
>>>> this only works for one node, while the data set is small.
>>>>
>>>> For many real cases, we may need to deal with gigabytes or even
>>>> terabytes of data, with thousands of raw categorical attributes, which can
>>>> lead to millions of discrete features, using 1-of-k representation.
>>>>
>>>> For these cases, one solution is to use distributed memory. That's why I
>>>> am considering spark. And spark support Python!
>>>> With Pyspark, we can import scikit-learn.
>>>>
>>>> But the question is how to make the scikit-learn code, decisionTree
>>>> classifier for example, running in distributed computing mode, to benefit
>>>> the power of Spark?
>>>>
>>>>
>>>> Best,
>>>> Rex
>>>>
>>>
>>

Reply via email to