Re: SparkR and RDDs
Hi Shivaram, Thanks for the details, it is greatly appreciated. Thanks On Wed, May 27, 2015 at 7:25 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Sorry for the delay in getting back on this. So the RDD interface is private in the 1.4 release but as Alek mentioned you can still use it by prefixing `SparkR:::`. Regarding design direction -- there are two JIRAs which cover major features we plan to work on for 1.5. SPARK-6805 tracks porting high-level machine learning operations like `glm` and `kmeans` to SparkR using the ML Pipeline implementation in Scala as the backend. We are also planning to develop a parallel API where users can run native R functions in a distributed setting and SPARK-7264 tracks this effort. If you have specific use cases feel free to chime in on the JIRA or on the dev mailing list. Thanks Shivaram On Tue, May 26, 2015 at 11:40 AM, Reynold Xin r...@databricks.com wrote: You definitely don't want to implement kmeans in R, since it would be very slow. Just providing R wrappers for the MLlib implementation is the way to go. I believe one of the major items in SparkR next is the MLlib wrappers. On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com wrote: Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing hidden methods or will better approach prevail? For instance, it feels like the k-means sample should really use MLlib and not just be a port the k-means sample using hidden methods. Am I looking at this incorrectly? Thanks, Andrew On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024 .
Re: SparkR and RDDs
From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.commailto:psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716chttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACEhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: SparkR and RDDs
You definitely don't want to implement kmeans in R, since it would be very slow. Just providing R wrappers for the MLlib implementation is the way to go. I believe one of the major items in SparkR next is the MLlib wrappers. On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com wrote: Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing hidden methods or will better approach prevail? For instance, it feels like the k-means sample should really use MLlib and not just be a port the k-means sample using hidden methods. Am I looking at this incorrectly? Thanks, Andrew On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: SparkR and RDDs
Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing hidden methods or will better approach prevail? For instance, it feels like the k-means sample should really use MLlib and not just be a port the k-means sample using hidden methods. Am I looking at this incorrectly? Thanks, Andrew On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.