PySpark MLlib Numpy Dependency

2015-07-28 Thread Eskilson,Aleksander
The documentation for the Numpy dependency for MLlib seems somewhat vague [1]. Is Numpy only a dependency for the driver node, or must it also be installed on every worker node? Thanks, Alek [1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies CONFIDENTIALITY NOTICE This

Re: Broadcast variables in R

2015-07-20 Thread Eskilson,Aleksander
Hi Serge, The broadcast function was made private when SparkR merged into Apache Spark for the 1.4.0 release. You can still use broadcast by specifying the private namespace though. SparkR:::broadcast(sc, obj) The RDD methods were considered very low-level, and the SparkR devs are still

Re: User Defined Functions - Execution on Clusters

2015-07-07 Thread Eskilson,Aleksander
that, they will be still much slower than Scala ones (because Python is lower and the overhead for calling Python). On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: Hi there, I’m trying to get a feel for how User Defined Functions from SparkSQL (as written

User Defined Functions - Execution on Clusters

2015-07-06 Thread Eskilson,Aleksander
Hi there, I’m trying to get a feel for how User Defined Functions from SparkSQL (as written in Python and registered using the udf function from pyspark.sql.functions) are run behind the scenes. Trying to grok the source it seems that the native Python function is serialized for distribution

Re: sparkR could not find function textFile

2015-06-26 Thread Eskilson,Aleksander
with a file with all columns as String, but the real data I want to process are all doubles. I'm just exploring what sparkR can do versus regular scala spark, as I am by heart a R person. 2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.commailto:alek.eskil...@cerner.com: Sure, I

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
Hi there, Parallelize is part of the RDD API which was made private for Spark v. 1.4.0. Some functions in the RDD API were considered too low-level to expose, so only most of the DataFrame API is currently public. The original rationale for this decision can be found on the issue's JIRA [1]. The

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Eskilson,Aleksander
The simple answer is that SparkR does support map/reduce operations over RDD’s through the RDD API, but since Spark v 1.4.0, those functions were made private in SparkR. They can still be accessed by prepending the function with the namespace, like SparkR:::lapply(rdd, func). It was thought

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.commailto:alek.eskil...@cerner.com: Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:44 GMT-07:00 Wei Zhou zhweisop...@gmail.commailto:zhweisop...@gmail.com: Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.commailto:alek.eskil

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
--- From: Eskilson,Aleksander alek.eskil...@cerner.com Sent: June 25, 2015 5:57 AM To: Felix C felixcheun...@hotmail.com, user@spark.apache.org Subject: Re: SparkR parallelize not found with 1.4.1? Hi there, Parallelize is part of the RDD API which was made private for Spark v. 1.4.0. Some

Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Eskilson,Aleksander
memory but its hard to say without more diagnostic information. Thanks Shivaram On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander alek.eskil...@cerner.commailto:alek.eskil...@cerner.com wrote: I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from

SparkR Jobs Hanging in collectPartitions

2015-05-26 Thread Eskilson,Aleksander
I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from a corpus not existing in a newline delimited dictionary. The R code is: dict - SparkR:::textFile(sc, src1) corpus - SparkR:::textFile(sc, src2) words - distinct(SparkR:::flatMap(corpus,