Re: Broadcast variables in R

2015-07-20 Thread Eskilson,Aleksander
Hi Serge,

The broadcast function was made private when SparkR merged into Apache
Spark for the 1.4.0 release. You can still use broadcast by specifying the
private namespace though.

SparkR:::broadcast(sc, obj)

The RDD methods were considered very low-level, and the SparkR devs are
still figuring out which of them they¹d like to expose along with the
higher-level DataFrame API. You can see the rationale for the decision on
the project JIRA [1].

[1] -- https://issues.apache.org/jira/browse/SPARK-7230

Hope that helps,
Alek

On 7/20/15, 12:00 PM, "Serge Franchois"  wrote:

>I've searched high and low to use broadcast variables in R.
>Is is possible at all? I don't see them mentioned in the SparkR API.
>Or is there another way of using this feature?
>
>I need to share a large amount of data between executors.
>At the moment, I get warned about my task being too large.
>
>I have tried pyspark, and there I can use them.
>
>Wkr,
>
>Serge
>
>
>
>
>--
>View this message in context:
>http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-in
>-R-tp23915.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PySpark MLlib Numpy Dependency

2015-07-28 Thread Eskilson,Aleksander
The documentation for the Numpy dependency for MLlib seems somewhat vague [1]. 
Is Numpy only a dependency for the driver node, or must it also be installed on 
every worker node?

Thanks,
Alek

[1] -- http://spark.apache.org/docs/latest/mllib-guide.html#dependencies

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


SparkR Jobs Hanging in collectPartitions

2015-05-26 Thread Eskilson,Aleksander
I’ve been attempting to run a SparkR translation of a similar Scala job that 
identifies words from a corpus not existing in a newline delimited dictionary. 
The R code is:

dict <- SparkR:::textFile(sc, src1)
corpus <- SparkR:::textFile(sc, src2)
words <- distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, 
“”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
found <- subtract(words, dict)

(where src1, src2 are locations on HDFS)

Then attempting something like take(found, 10) or saveAsTextFile(found, dest) 
should realize the collection, but that stage of the DAG hangs in Scheduler 
Delay during the collectPartitions phase.

Synonymous Scala code however,
val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
val dict = sc.textFile(src2)
val words = corpus.map(word => word.filter(Character.isLetter(_))).disctinct()
val found = words.subtract(dict)

performs as expected. Any thoughts?

Thanks,
Alek Eskilson

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: SparkR Jobs Hanging in collectPartitions

2015-05-29 Thread Eskilson,Aleksander
Sure. Looking more closely at the code, I thought I might have had an error in 
the flow of data structures in the R code, the line that extracts the words 
from the corpus is now,
words <- distinct(SparkR:::flatMap(corpus function(line) {
strsplit(
gsub(“^\\s+|[[:punct:]]”, “”, tolower(line)),
“\\s”)[[1]]
}))
(just removes leading whitespace and all punctuation after having made the 
whole line lowercase, then splits to a vector of words, ultimately flattening 
the whole collection)

Counts works on the resultant words list, returning the value expected, so the 
hang most likely occurs during the subtract. I should mention, the size of the 
corpus is very small, just kb in size. The dictionary I subtract against is 
also quite modest by Spark standards, just 4.8MB, and I’ve got 2G memory for 
the Worker, which ought to be sufficient for such a small job.

The Scala analog runs quite fast, even with the subtract. If we look at the DAG 
for the SparkR job and compare that against the event timeline for Stage 3, it 
seems the job is stuck in Scheduler Delay (in 0/2 tasks completed) and never 
begins the rest of the stage. Unfortunately, the executor log hangs up as well, 
and doesn’t give much info.
[cid:F966AC39-9916-4CBD-B447-5BF1C136F67E]

Could you describe in a little more detail at what points data is actually held 
in R’s internal process memory? I was under the impression that 
SparkR:::textFile created an RDD object that would only be realized when a DAG 
requiring it was executed, and would therefore be part of the memory managed by 
Spark, and that memory would only be moved to R as an R object following a 
collect(), take(), etc.

Thanks,
Alek Eskilson
From: Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>>
Reply-To: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
mailto:shiva...@eecs.berkeley.edu>>
Date: Wednesday, May 27, 2015 at 8:26 PM
To: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkR Jobs Hanging in collectPartitions

Could you try to see which phase is causing the hang ? i.e. If you do a count() 
after flatMap does that work correctly ? My guess is that the hang is somehow 
related to data not fitting in the R process memory but its hard to say without 
more diagnostic information.

Thanks
Shivaram

On Tue, May 26, 2015 at 7:28 AM, Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>> wrote:
I’ve been attempting to run a SparkR translation of a similar Scala job that 
identifies words from a corpus not existing in a newline delimited dictionary. 
The R code is:

dict <- SparkR:::textFile(sc, src1)
corpus <- SparkR:::textFile(sc, src2)
words <- distinct(SparkR:::flatMap(corpus, function(line) { gsub(“[[:punct:]]”, 
“”, tolower(strsplit(line, “ |,|-“)[[1]]))}))
found <- subtract(words, dict)

(where src1, src2 are locations on HDFS)

Then attempting something like take(found, 10) or saveAsTextFile(found, dest) 
should realize the collection, but that stage of the DAG hangs in Scheduler 
Delay during the collectPartitions phase.

Synonymous Scala code however,
val corpus = sc.textFile(src1).flatMap(_.split(“ |,|-“))
val dict = sc.textFile(src2)
val words = corpus.map(word => word.filter(Character.isLetter(_))).disctinct()
val found = words.subtract(dict)

performs as expected. Any thoughts?

Thanks,
Alek Eskilson
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.



Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
Hi there,

Parallelize is part of the RDD API which was made private for Spark v.
1.4.0. Some functions in the RDD API were considered too low-level to
expose, so only most of the DataFrame API is currently public. The
original rationale for this decision can be found on the issue's JIRA [1].
The devs are still considering which parts of the RDD API, if any, should
be made public for later releases. If you have some use case that you feel
is most easily addressed by the functions currently private in the RDD
API, go ahead and let the dev mailing list know.

Alek
[1] -- https://issues.apache.org/jira/browse/SPARK-7230

On 6/25/15, 12:24 AM, "Felix C"  wrote:

>Hi,
>
>It must be something very straightforward...
>
>Not working:
>parallelize(sc)
>Error: could not find function "parallelize"
>
>Working:
>df <- createDataFrame(sqlContext, localDF)
>
>What did I miss?
>Thanks
>?B�CB�
>?�?[��X��ܚX�K??K[XZ[?�?\�\�][��X��ܚX�P?�?\�˘\?X�?K�ܙ�B��܈?Y??]?[ۘ[??��[X[�
>?�??K[XZ[?�?\�\�Z?[???�?\�˘\?X�?K�ܙ�B�B


CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: How to Map and Reduce in sparkR

2015-06-25 Thread Eskilson,Aleksander
The  simple answer is that SparkR does support map/reduce operations over RDD’s 
through the RDD API, but since Spark v 1.4.0, those functions were made private 
in SparkR. They can still be accessed by prepending the function with the 
namespace, like SparkR:::lapply(rdd, func). It was thought though that many of 
the functions in the RDD API were too low level to expose, with much more of 
the focus going into the DataFrame API. The original rationale for this 
decision can be found in its JIRA [1]. The devs are still deciding which 
functions of the RDD API, if any, should be made public for future releases. If 
you feel some use cases are most easily handled in SparkR through RDD 
functions, go ahead and let the dev email list know.

Alek
[1] -- https://issues.apache.org/jira/browse/SPARK-7230

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Wednesday, June 24, 2015 at 4:59 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: How to Map and Reduce in sparkR

Anyone knows whether sparkR supports map and reduce operations as the RDD 
transformations? Thanks in advance.

Best,
Wei

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
I forgot to mention that if you need to access these functions for some
reason, you can prepend the function call with the SparkR private
namespace, like so,
SparkR:::lapply(rdd, func).

On 6/25/15, 9:30 AM, "Felix C"  wrote:

>Thanks! It's good to know
>
>--- Original Message ---
>
>From: "Eskilson,Aleksander" 
>Sent: June 25, 2015 5:57 AM
>To: "Felix C" , user@spark.apache.org
>Subject: Re: SparkR parallelize not found with 1.4.1?
>
>Hi there,
>
>Parallelize is part of the RDD API which was made private for Spark v.
>1.4.0. Some functions in the RDD API were considered too low-level to
>expose, so only most of the DataFrame API is currently public. The
>original rationale for this decision can be found on the issue's JIRA [1].
>The devs are still considering which parts of the RDD API, if any, should
>be made public for later releases. If you have some use case that you feel
>is most easily addressed by the functions currently private in the RDD
>API, go ahead and let the dev mailing list know.
>
>Alek
>[1] -- https://issues.apache.org/jira/browse/SPARK-7230
>
>On 6/25/15, 12:24 AM, "Felix C"  wrote:
>
>>Hi,
>>
>>It must be something very straightforward...
>>
>>Not working:
>>parallelize(sc)
>>Error: could not find function "parallelize"
>>
>>Working:
>>df <- createDataFrame(sqlContext, localDF)
>>
>>What did I miss?
>>Thanks
>>?B�CB
>>�
>>?�?[��X��ܚX�K??K[XZ[?�?\�\�][��X��ܚX�P?�?\�˘\?X�?K�ܙ�B��܈?Y??]?[ۘ[??��[X[
>>�
>>?�??K[XZ[?�?\�\�Z?[???�?\�˘\?X�?K�ܙ�B�B
>
>
>CONFIDENTIALITY NOTICE This message and any included attachments are from
>Cerner Corporation and are intended only for the addressee. The
>information contained in this message is confidential and may constitute
>inside or non-public information under international, federal, or state
>securities laws. Unauthorized forwarding, printing, copying,
>distribution, or use of such information is strictly prohibited and may
>be unlawful. If you are not the addressee, please promptly delete this
>message and notify the sender of the delivery error by e-mail or you may
>call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
>(816)221-1024.
>?B�CB�
>?�?[��X��ܚX�K??K[XZ[?�?\�\�][��X��ܚX�P?�?\�˘\?X�?K�ܙ�B��܈?Y??]?[ۘ[??��[X[�
>?�??K[XZ[?�?\�\�Z?[???�?\�˘\?X�?K�ܙ�B�B



Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Hi there,

The tutorial you’re reading there was written before the merge of SparkR for 
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made 
private, as the devs felt many of its functions were too low level. They 
focused instead on finishing the DataFrame API which supports local, HDFS, and 
Hive/HBase file reads. In the meantime, the devs are trying to determine which 
functions of the RDD API, if any, should be made public again. You can see the 
rationale behind this decision on the issue’s JIRA [1].

You can still make use of those now private RDD functions by prepending the 
function call with the SparkR private namespace, for example, you’d use
SparkR:::textFile(…).

Hope that helps,
Alek

[1] -- https://issues.apache.org/jira/browse/SPARK-7230

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 3:33 PM
To: "user@spark.apache.org" 
mailto:user@spark.apache.org>>
Subject: sparkR could not find function "textFile"

Hi all,

I am exploring sparkR by activating the shell and following the tutorial here 
https://amplab-extras.github.io/SparkR-pkg/

And when I tried to read in a local file with textFile(sc, "file_location"), it 
gives an error could not find function "textFile".

By reading through sparkR doc for 1.4, it seems that we need sqlContext to 
import data, for example.

people <- read.df(sqlContext, "./examples/src/main/resources/people.json", 
"json"

)
And we need to specify the file type.

My question is does sparkR stop supporting general type file importing? If not, 
would appreciate any help on how to do this.

PS, I am trying to recreate the word count example in sparkR, and want to 
import README.md file, or just any file into sparkR.

Thanks in advance.

Best,
Wei

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Yeah, that’s probably because the head() you’re invoking there is defined for 
SparkR DataFrames [1] (note how you don’t have to use the SparkR::: namepsace 
in front of it), but SparkR:::textFile() returns an RDD object, which is more 
like a distributed list data structure the way you’re applying it over that .md 
text file. If you want to look at the first item or first several items in the 
RDD, I think you want to use SparkR:::first() or SparkR:::take(), both of which 
are applied to RDDs.

Just remember that all the functions described in the public API [2] for SparkR 
right now are related mostly to working with DataFrames. You’ll have to use the 
R command line doc or look at the RDD source code for all the private functions 
you might want (which includes the doc strings used to make the R doc), 
whichever you find easier.

Alek

[1] -- http://spark.apache.org/docs/latest/api/R/head.html
[2] -- https://spark.apache.org/docs/latest/api/R/index.html
[3] -- https://github.com/apache/spark/blob/master/R/pkg/R/RDD.R

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 3:49 PM
To: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

Hi Alek,

Just a follow up question. This is what I did in sparkR shell:

lines <- SparkR:::textFile(sc, "./README.md")
head(lines)

And I am getting error:

"Error in x[seq_len(n)] : object of type 'S4' is not subsettable"

I'm wondering what did I do wrong. Thanks in advance.

Wei

2015-06-25 13:44 GMT-07:00 Wei Zhou 
mailto:zhweisop...@gmail.com>>:
Hi Alek,

Thanks for the explanation, it is very helpful.

Cheers,
Wei

2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>>:
Hi there,

The tutorial you’re reading there was written before the merge of SparkR for 
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made 
private, as the devs felt many of its functions were too low level. They 
focused instead on finishing the DataFrame API which supports local, HDFS, and 
Hive/HBase file reads. In the meantime, the devs are trying to determine which 
functions of the RDD API, if any, should be made public again. You can see the 
rationale behind this decision on the issue’s JIRA [1].

You can still make use of those now private RDD functions by prepending the 
function call with the SparkR private namespace, for example, you’d use
SparkR:::textFile(…).

Hope that helps,
Alek

[1] -- 
https://issues.apache.org/jira/browse/SPARK-7230<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D7230&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=7RxLcWCdPWHoYk05KGwnohDZDileOX4Wo7Ht5SFge4I&s=ruNsApqV-sn8sBzSgJW0PIZ5beD_TvhLulQjeabR7p8&e=>

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 3:33 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: sparkR could not find function "textFile"

Hi all,

I am exploring sparkR by activating the shell and following the tutorial here 
https://amplab-extras.github.io/SparkR-pkg/<https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=>

And when I tried to read in a local file with textFile(sc, "file_location"), it 
gives an error could not find function "textFile".

By reading through sparkR doc for 1.4, it seems that we need sqlContext to 
import data, for example.

people <- read.df(sqlContext, "./examples/src/main/resources/people.json", 
"json"

)
And we need to specify the file type.

My question is does sparkR stop supporting general type file importing? If not, 
would appreciate any help on how to do this.

PS, I am trying to recreate the word count example in sparkR, and want to 
import README.md file, or just any file into sparkR.

Thanks in advance.

Best,
Wei

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.




Re: sparkR could not find function "textFile"

2015-06-25 Thread Eskilson,Aleksander
Sure, I had a similar question that Shivaram was able fast for me, the solution 
is implemented using a separate DataBrick’s library. Check out this thread from 
the email archives [1], and the read.df() command [2]. CSV files can be a bit 
tricky, especially with inferring their schemas. Are you using just strings as 
your column types right now?

Alek

[1] -- 
http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html
[2] -- https://spark.apache.org/docs/latest/api/R/read.df.html

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 4:15 PM
To: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
mailto:shiva...@eecs.berkeley.edu>>
Cc: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma 
separated flat files, what would you recommend me to do? One way I can think of 
is first reading the data as you would do in r, using read.table(), and then 
create spark DataFrame out of that R dataframe, but it is obviously not 
scalable.


2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>>:
The `head` function is not supported for the RRDD that is returned by 
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that the 
RDD API in SparkR is private because we might not support it in the upcoming 
releases. So if you can use the DataFrame API for your application you should 
try that out.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou 
mailto:zhweisop...@gmail.com>> wrote:
Hi Alek,

Just a follow up question. This is what I did in sparkR shell:

lines <- SparkR:::textFile(sc, "./README.md")
head(lines)

And I am getting error:

"Error in x[seq_len(n)] : object of type 'S4' is not subsettable"

I'm wondering what did I do wrong. Thanks in advance.

Wei

2015-06-25 13:44 GMT-07:00 Wei Zhou 
mailto:zhweisop...@gmail.com>>:
Hi Alek,

Thanks for the explanation, it is very helpful.

Cheers,
Wei

2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>>:
Hi there,

The tutorial you’re reading there was written before the merge of SparkR for 
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made 
private, as the devs felt many of its functions were too low level. They 
focused instead on finishing the DataFrame API which supports local, HDFS, and 
Hive/HBase file reads. In the meantime, the devs are trying to determine which 
functions of the RDD API, if any, should be made public again. You can see the 
rationale behind this decision on the issue’s JIRA [1].

You can still make use of those now private RDD functions by prepending the 
function call with the SparkR private namespace, for example, you’d use
SparkR:::textFile(…).

Hope that helps,
Alek

[1] -- 
https://issues.apache.org/jira/browse/SPARK-7230<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D7230&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=x60a-3ztBe4XOw2bOnEI9-Mc6mENXT8PVxYvsmTLVG8&s=HpX1Cpayu5Mwu9JVt2znimJyUwtV3vcPurUO9ZJhASo&e=>

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 3:33 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: sparkR could not find function "textFile"

Hi all,

I am exploring sparkR by activating the shell and following the tutorial here 
https://amplab-extras.github.io/SparkR-pkg/<https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=>

And when I tried to read in a local file with textFile(sc, "file_location"), it 
gives an error could not find function "textFile".

By reading through sparkR doc for 1.4, it seems that we need sqlContext to 
import data, for example.

people <- read.df(sqlContext, "./examples/src/main/resources/people.json", 
"json"

)
And we need to specify the file type.

My question is does sparkR stop supporting general type file importing? If not, 
would appreciate any help on how to do this.

PS, I am trying to recreate the word count example in sparkR, and want to 
import README.md file, or just any file into sparkR.

Thanks in advance.

Best,
Wei

CONFIDENTIALITY NOTICE This message and any included attachments are

Re: sparkR could not find function "textFile"

2015-06-26 Thread Eskilson,Aleksander
Yeah, I ask because you might notice that by default the column types for CSV 
tables read in by read.df() are only strings (due to limitations in type 
inferencing in the DataBricks package). There was a separate discussion about 
schema inferencing, and Shivaram recently merged support for specifying your 
own schema as an argument to read.df(). The schema is defined as a structType. 
To see how this schema is declared, check out Hossein Falaki’s response in this 
thread [1].

— Alek

[1] -- 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-DataFrame-Column-Casts-esp-from-CSV-Files-td12589.html

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 4:38 PM
To: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>
Cc: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
mailto:shiva...@eecs.berkeley.edu>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

I tried out the solution using spark-csv package, and it worked fine now :) 
Thanks. Yes, I'm playing with a file with all columns as String, but the real 
data I want to process are all doubles. I'm just exploring what sparkR can do 
versus regular scala spark, as I am by heart a R person.

2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>>:
Sure, I had a similar question that Shivaram was able fast for me, the solution 
is implemented using a separate DataBrick’s library. Check out this thread from 
the email archives [1], and the read.df() command [2]. CSV files can be a bit 
tricky, especially with inferring their schemas. Are you using just strings as 
your column types right now?

Alek

[1] -- 
http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_CSV-2DSupport-2Din-2DSparkR-2Dtd12559.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=MeTdQL6Tu4ePptdhzETIQCfYoKV4uviQnm4tHwbEPt4&e=>
[2] -- 
https://spark.apache.org/docs/latest/api/R/read.df.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_api_R_read.df.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=gefwwtL7oNXhDYn7tlpO3OcFgaZJ9ep3-lzQn2gP6bo&e=>

From: Wei Zhou mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 4:15 PM
To: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
mailto:shiva...@eecs.berkeley.edu>>
Cc: Aleksander Eskilson 
mailto:alek.eskil...@cerner.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma 
separated flat files, what would you recommend me to do? One way I can think of 
is first reading the data as you would do in r, using read.table(), and then 
create spark DataFrame out of that R dataframe, but it is obviously not 
scalable.


2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman 
mailto:shiva...@eecs.berkeley.edu>>:
The `head` function is not supported for the RRDD that is returned by 
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that the 
RDD API in SparkR is private because we might not support it in the upcoming 
releases. So if you can use the DataFrame API for your application you should 
try that out.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou 
mailto:zhweisop...@gmail.com>> wrote:
Hi Alek,

Just a follow up question. This is what I did in sparkR shell:

lines <- SparkR:::textFile(sc, "./README.md")
head(lines)

And I am getting error:

"Error in x[seq_len(n)] : object of type 'S4' is not subsettable"

I'm wondering what did I do wrong. Thanks in advance.

Wei

2015-06-25 13:44 GMT-07:00 Wei Zhou 
mailto:zhweisop...@gmail.com>>:
Hi Alek,

Thanks for the explanation, it is very helpful.

Cheers,
Wei

2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
mailto:alek.eskil...@cerner.com>>:
Hi there,

The tutorial you’re reading there was written before the merge of SparkR for 
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made 
private, as the devs felt many of its functions were too low level. They 
focused instead on finishing the DataFrame API which supports local, HDFS, and 
Hive/HBase file reads. In the meantime, the devs are trying to determine which 
functions of th

User Defined Functions - Execution on Clusters

2015-07-06 Thread Eskilson,Aleksander
Hi there,

I’m trying to get a feel for how User Defined Functions from SparkSQL (as 
written in Python and registered using the udf function from 
pyspark.sql.functions) are run behind the scenes. Trying to grok the source it 
seems that the native Python function is serialized for distribution to the 
clusters. In practice, it seems to be able to check for other variables and 
functions defined elsewhere in the namepsace and include those in the 
function’s serialization.

Following all this though, when actually run, are Python interpreter instances 
on each node brought up to actually run the function against the RDDs, or can 
the serialized function somehow be run on just the JVM? If bringing up Python 
instances is the execution model, what is the overhead of PySpark UDFs like as 
compared to those registered in Scala?

Thanks,
Alek

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: User Defined Functions - Execution on Clusters

2015-07-07 Thread Eskilson,Aleksander
Interesting, thanks for the heads up.

On 7/6/15, 7:19 PM, "Davies Liu"  wrote:

>Currently, Python UDFs run in a Python instances, are MUCH slower than
>Scala ones (from 10 to 100x). There is JIRA to improve the
>performance: https://issues.apache.org/jira/browse/SPARK-8632, After
>that, they will be still much slower than Scala ones (because Python
>is lower and the overhead for calling Python).
>
>On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander
> wrote:
>> Hi there,
>>
>> I’m trying to get a feel for how User Defined Functions from SparkSQL
>>(as
>> written in Python and registered using the udf function from
>> pyspark.sql.functions) are run behind the scenes. Trying to grok the
>>source
>> it seems that the native Python function is serialized for distribution
>>to
>> the clusters. In practice, it seems to be able to check for other
>>variables
>> and functions defined elsewhere in the namepsace and include those in
>>the
>> function’s serialization.
>>
>> Following all this though, when actually run, are Python interpreter
>> instances on each node brought up to actually run the function against
>>the
>> RDDs, or can the serialized function somehow be run on just the JVM? If
>> bringing up Python instances is the execution model, what is the
>>overhead of
>> PySpark UDFs like as compared to those registered in Scala?
>>
>> Thanks,
>> Alek
>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>from
>> Cerner Corporation and are intended only for the addressee. The
>>information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use
>>of
>> such information is strictly prohibited and may be unlawful. If you are
>>not
>> the addressee, please promptly delete this message and notify the
>>sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices
>>in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org