Yeah, I noticed all columns are cast into strings. Thanks Alek for pointing out the solution before I even encountered the problem.
2015-06-26 7:01 GMT-07:00 Eskilson,Aleksander <alek.eskil...@cerner.com>: > Yeah, I ask because you might notice that by default the column types > for CSV tables read in by read.df() are only strings (due to limitations in > type inferencing in the DataBricks package). There was a separate > discussion about schema inferencing, and Shivaram recently merged support > for specifying your own schema as an argument to read.df(). The schema is > defined as a structType. To see how this schema is declared, check out > Hossein Falaki’s response in this thread [1]. > > — Alek > > [1] -- > http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-DataFrame-Column-Casts-esp-from-CSV-Files-td12589.html > > From: Wei Zhou <zhweisop...@gmail.com> > Date: Thursday, June 25, 2015 at 4:38 PM > To: Aleksander Eskilson <alek.eskil...@cerner.com> > Cc: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>, " > user@spark.apache.org" <user@spark.apache.org> > > Subject: Re: sparkR could not find function "textFile" > > I tried out the solution using spark-csv package, and it worked fine > now :) Thanks. Yes, I'm playing with a file with all columns as String, but > the real data I want to process are all doubles. I'm just exploring what > sparkR can do versus regular scala spark, as I am by heart a R person. > > 2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander <alek.eskil...@cerner.com>: > >> Sure, I had a similar question that Shivaram was able fast for me, the >> solution is implemented using a separate DataBrick’s library. Check out >> this thread from the email archives [1], and the read.df() command [2]. CSV >> files can be a bit tricky, especially with inferring their schemas. Are you >> using just strings as your column types right now? >> >> Alek >> >> [1] -- >> http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_CSV-2DSupport-2Din-2DSparkR-2Dtd12559.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=MeTdQL6Tu4ePptdhzETIQCfYoKV4uviQnm4tHwbEPt4&e=> >> [2] -- https://spark.apache.org/docs/latest/api/R/read.df.html >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_api_R_read.df.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=gefwwtL7oNXhDYn7tlpO3OcFgaZJ9ep3-lzQn2gP6bo&e=> >> >> From: Wei Zhou <zhweisop...@gmail.com> >> Date: Thursday, June 25, 2015 at 4:15 PM >> To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu> >> Cc: Aleksander Eskilson <alek.eskil...@cerner.com>, " >> user@spark.apache.org" <user@spark.apache.org> >> Subject: Re: sparkR could not find function "textFile" >> >> Thanks to both Shivaram and Alek. Then if I want to create DataFrame >> from comma separated flat files, what would you recommend me to do? One way >> I can think of is first reading the data as you would do in r, using >> read.table(), and then create spark DataFrame out of that R dataframe, but >> it is obviously not scalable. >> >> >> 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu>: >> >>> The `head` function is not supported for the RRDD that is returned by >>> `textFile`. You can run `take(lines, 5L)`. I should add a warning here that >>> the RDD API in SparkR is private because we might not support it in the >>> upcoming releases. So if you can use the DataFrame API for your application >>> you should try that out. >>> >>> Thanks >>> Shivaram >>> >>> On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou <zhweisop...@gmail.com> wrote: >>> >>>> Hi Alek, >>>> >>>> Just a follow up question. This is what I did in sparkR shell: >>>> >>>> lines <- SparkR:::textFile(sc, "./README.md") >>>> head(lines) >>>> >>>> And I am getting error: >>>> >>>> "Error in x[seq_len(n)] : object of type 'S4' is not subsettable" >>>> >>>> I'm wondering what did I do wrong. Thanks in advance. >>>> >>>> Wei >>>> >>>> 2015-06-25 13:44 GMT-07:00 Wei Zhou <zhweisop...@gmail.com>: >>>> >>>>> Hi Alek, >>>>> >>>>> Thanks for the explanation, it is very helpful. >>>>> >>>>> Cheers, >>>>> Wei >>>>> >>>>> 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander < >>>>> alek.eskil...@cerner.com>: >>>>> >>>>>> Hi there, >>>>>> >>>>>> The tutorial you’re reading there was written before the merge of >>>>>> SparkR for Spark 1.4.0 >>>>>> For the merge, the RDD API (which includes the textFile() function) >>>>>> was made private, as the devs felt many of its functions were too low >>>>>> level. They focused instead on finishing the DataFrame API which supports >>>>>> local, HDFS, and Hive/HBase file reads. In the meantime, the devs are >>>>>> trying to determine which functions of the RDD API, if any, should be >>>>>> made >>>>>> public again. You can see the rationale behind this decision on the >>>>>> issue’s >>>>>> JIRA [1]. >>>>>> >>>>>> You can still make use of those now private RDD functions by >>>>>> prepending the function call with the SparkR private namespace, for >>>>>> example, you’d use >>>>>> SparkR:::textFile(…). >>>>>> >>>>>> Hope that helps, >>>>>> Alek >>>>>> >>>>>> [1] -- https://issues.apache.org/jira/browse/SPARK-7230 >>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D7230&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=x60a-3ztBe4XOw2bOnEI9-Mc6mENXT8PVxYvsmTLVG8&s=HpX1Cpayu5Mwu9JVt2znimJyUwtV3vcPurUO9ZJhASo&e=> >>>>>> >>>>>> From: Wei Zhou <zhweisop...@gmail.com> >>>>>> Date: Thursday, June 25, 2015 at 3:33 PM >>>>>> To: "user@spark.apache.org" <user@spark.apache.org> >>>>>> Subject: sparkR could not find function "textFile" >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I am exploring sparkR by activating the shell and following the >>>>>> tutorial here https://amplab-extras.github.io/SparkR-pkg/ >>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=> >>>>>> >>>>>> And when I tried to read in a local file with textFile(sc, >>>>>> "file_location"), it gives an error could not find function "textFile". >>>>>> >>>>>> By reading through sparkR doc for 1.4, it seems that we need >>>>>> sqlContext to import data, for example. >>>>>> >>>>>> people <- read.df(sqlContext, >>>>>> "./examples/src/main/resources/people.json", "json" >>>>>> >>>>>> ) >>>>>> And we need to specify the file type. >>>>>> >>>>>> My question is does sparkR stop supporting general type file >>>>>> importing? If not, would appreciate any help on how to do this. >>>>>> >>>>>> PS, I am trying to recreate the word count example in sparkR, and >>>>>> want to import README.md file, or just any file into sparkR. >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>>> Best, >>>>>> Wei >>>>>> >>>>>> CONFIDENTIALITY NOTICE This message and any included attachments >>>>>> are from Cerner Corporation and are intended only for the addressee. The >>>>>> information contained in this message is confidential and may constitute >>>>>> inside or non-public information under international, federal, or state >>>>>> securities laws. Unauthorized forwarding, printing, copying, >>>>>> distribution, >>>>>> or use of such information is strictly prohibited and may be unlawful. If >>>>>> you are not the addressee, please promptly delete this message and notify >>>>>> the sender of the delivery error by e-mail or you may call Cerner's >>>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) >>>>>> (816)221-1024. >>>>>> >>>>> >>>>> >>>> >>> >> >