Re: SparkR and RDDs

2015-05-27 Thread Andrew Psaltis
Hi Shivaram,
Thanks for the details, it is greatly appreciated.

Thanks

On Wed, May 27, 2015 at 7:25 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 Sorry for the delay in getting back on this. So the RDD interface is
 private in the 1.4 release but as Alek mentioned you can still use it by
 prefixing `SparkR:::`.

 Regarding design direction -- there are two JIRAs which cover major
 features we plan to work on for 1.5. SPARK-6805 tracks porting high-level
 machine learning operations like `glm` and `kmeans` to SparkR using the ML
 Pipeline implementation in Scala as the backend.

 We are also planning to develop a parallel API where users can run native
 R functions in a distributed setting and SPARK-7264 tracks this effort. If
 you have specific use cases feel free to chime in on the JIRA or on the dev
 mailing list.

 Thanks
 Shivaram

 On Tue, May 26, 2015 at 11:40 AM, Reynold Xin r...@databricks.com wrote:

 You definitely don't want to implement kmeans in R, since it would be
 very slow. Just providing R wrappers for the MLlib implementation is the
 way to go. I believe one of the major items in SparkR next is the MLlib
 wrappers.



 On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com
  wrote:

 Hi Alek,
 Thanks for the info. You are correct ,that using the three colons does
 work. Admittedly I am a R novice, but since the three colons is used to
 access hidden methods, it seems pretty dirty.

 Can someone shed light on the design direction being taken with SparkR?
 Should I really be accessing hidden methods or will better approach
 prevail? For instance, it feels like the k-means sample should really use
 MLlib and not just be a port the k-means sample using hidden methods. Am I
 looking at this incorrectly?

 Thanks,
 Andrew

 On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander 
 alek.eskil...@cerner.com wrote:

  From the changes to the namespace file, that appears to be correct,
 all methods of the RDD API have been made private, which in R means that
 you may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the
 merge could be to identify those methods in the script belonging to the RDD
 class and be sure they have the namespace identifier tacked on the front. I
 hope that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis psaltis.and...@gmail.com
 Date: Monday, May 25, 2015 at 6:25 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]
  that the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments
 are from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
 .







Re: SparkR and RDDs

2015-05-26 Thread Eskilson,Aleksander
From the changes to the namespace file, that appears to be correct, all 
methods of the RDD API have been made private, which in R means that you may 
still access them by using the namespace prefix SparkR with three colons, e.g. 
SparkR:::func(foo, bar).

So a starting place for porting old SparkR scripts from before the merge could 
be to identify those methods in the script belonging to the RDD class and be 
sure they have the namespace identifier tacked on the front. I hope that helps.

Regards,
Alek Eskilson

From: Andrew Psaltis psaltis.and...@gmail.commailto:psaltis.and...@gmail.com
Date: Monday, May 25, 2015 at 6:25 PM
To: dev@spark.apache.orgmailto:dev@spark.apache.org 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: SparkR and RDDs

Hi,
I understand from SPARK-6799[1] and the respective merge commit [2]  that the 
RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or 
LR examples so that the computation happened in Spark what is the best 
direction to go? Sorry if I am missing something obvious, but based on the 
NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the 
obvious direction to go.

Thanks in advance,
Andrew

[1] 
https://issues.apache.org/jira/browse/SPARK-6799https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
[2] 
https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716chttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
[3] 
https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACEhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: SparkR and RDDs

2015-05-26 Thread Reynold Xin
You definitely don't want to implement kmeans in R, since it would be very
slow. Just providing R wrappers for the MLlib implementation is the way to
go. I believe one of the major items in SparkR next is the MLlib wrappers.



On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com
wrote:

 Hi Alek,
 Thanks for the info. You are correct ,that using the three colons does
 work. Admittedly I am a R novice, but since the three colons is used to
 access hidden methods, it seems pretty dirty.

 Can someone shed light on the design direction being taken with SparkR?
 Should I really be accessing hidden methods or will better approach
 prevail? For instance, it feels like the k-means sample should really use
 MLlib and not just be a port the k-means sample using hidden methods. Am I
 looking at this incorrectly?

 Thanks,
 Andrew

 On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander 
 alek.eskil...@cerner.com wrote:

  From the changes to the namespace file, that appears to be correct, all
 methods of the RDD API have been made private, which in R means that you
 may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the
 merge could be to identify those methods in the script belonging to the RDD
 class and be sure they have the namespace identifier tacked on the front. I
 hope that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis psaltis.and...@gmail.com
 Date: Monday, May 25, 2015 at 6:25 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]  that
 the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.





Re: SparkR and RDDs

2015-05-26 Thread Andrew Psaltis
Hi Alek,
Thanks for the info. You are correct ,that using the three colons does
work. Admittedly I am a R novice, but since the three colons is used to
access hidden methods, it seems pretty dirty.

Can someone shed light on the design direction being taken with SparkR?
Should I really be accessing hidden methods or will better approach
prevail? For instance, it feels like the k-means sample should really use
MLlib and not just be a port the k-means sample using hidden methods. Am I
looking at this incorrectly?

Thanks,
Andrew

On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  From the changes to the namespace file, that appears to be correct, all
 methods of the RDD API have been made private, which in R means that you
 may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the merge
 could be to identify those methods in the script belonging to the RDD class
 and be sure they have the namespace identifier tacked on the front. I hope
 that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis psaltis.and...@gmail.com
 Date: Monday, May 25, 2015 at 6:25 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]  that
 the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.