Re: SparkR and RDDs

Shivaram Venkataraman Wed, 27 May 2015 18:27:03 -0700

Sorry for the delay in getting back on this. So the RDD interface is
private in the 1.4 release but as Alek mentioned you can still use it by
prefixing `SparkR:::`.


Regarding design direction -- there are two JIRAs which cover major
features we plan to work on for 1.5. SPARK-6805 tracks porting high-level
machine learning operations like `glm` and `kmeans` to SparkR using the ML
Pipeline implementation in Scala as the backend.

We are also planning to develop a parallel API where users can run native R
functions in a distributed setting and SPARK-7264 tracks this effort. If
you have specific use cases feel free to chime in on the JIRA or on the dev
mailing list.

Thanks
Shivaram

On Tue, May 26, 2015 at 11:40 AM, Reynold Xin <[email protected]> wrote:

> You definitely don't want to implement kmeans in R, since it would be very
> slow. Just providing R wrappers for the MLlib implementation is the way to
> go. I believe one of the major items in SparkR next is the MLlib wrappers.
>
>
>
> On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis <[email protected]>
> wrote:
>
>> Hi Alek,
>> Thanks for the info. You are correct ,that using the three colons does
>> work. Admittedly I am a R novice, but since the three colons is used to
>> access hidden methods, it seems pretty dirty.
>>
>> Can someone shed light on the design direction being taken with SparkR?
>> Should I really be accessing hidden methods or will better approach
>> prevail? For instance, it feels like the k-means sample should really use
>> MLlib and not just be a port the k-means sample using hidden methods. Am I
>> looking at this incorrectly?
>>
>> Thanks,
>> Andrew
>>
>> On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander <
>> [email protected]> wrote:
>>
>>>  From the changes to the namespace file, that appears to be correct,
>>> all methods of the RDD API have been made private, which in R means that
>>> you may still access them by using the namespace prefix SparkR with three
>>> colons, e.g. SparkR:::func(foo, bar).
>>>
>>>  So a starting place for porting old SparkR scripts from before the
>>> merge could be to identify those methods in the script belonging to the RDD
>>> class and be sure they have the namespace identifier tacked on the front. I
>>> hope that helps.
>>>
>>>  Regards,
>>> Alek Eskilson
>>>
>>>   From: Andrew Psaltis <[email protected]>
>>> Date: Monday, May 25, 2015 at 6:25 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: SparkR and RDDs
>>>
>>>   Hi,
>>> I understand from SPARK-6799[1] and the respective merge commit [2]
>>>  that the RDD class is private in Spark 1.4 . If I wanted to modify the old
>>> Kmeans and/or LR examples so that the computation happened in Spark what is
>>> the best direction to go? Sorry if I am missing something obvious, but
>>> based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
>>> seeing the obvious direction to go.
>>>
>>>  Thanks in advance,
>>> Andrew
>>>
>>>  [1] https://issues.apache.org/jira/browse/SPARK-6799
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtc&e=>
>>> [2]
>>> https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eo&e=>
>>> [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACE&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8&e=>
>>>
>>>    CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>

Re: SparkR and RDDs

Reply via email to