Re: SparkR - calling as.vector() with rdd dataframe causes error
localDF is a pure R data frame and as.vector will work with no problems, as for calling it in the SparkR objects, try calling collect before you call as.vector (or in your case, the algorithms), that should solve your problem. On Mon, Sep 21, 2015 at 8:48 AM, Ellen Kraffmiller < ellen.kraffmil...@gmail.com> wrote: > Thank you for the link! I was using > http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see > replies there. > > Regarding your code example, I'm doing the same thing and successfully > creating the rdd, but the problem is that when I call a clustering > algorithm like amap::hcluster(), I get an error from as.vector() that the > rdd cannot be coerced into a vector. > > On Fri, Sep 18, 2015 at 12:33 PM, Luciano Resende> wrote: > >> I see the thread with all the responses on the bottom at mail-archive : >> >> https://www.mail-archive.com/user%40spark.apache.org/msg36882.html >> >> On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller < >> ellen.kraffmil...@gmail.com> wrote: >> >>> Thanks for your response. Is there a reason why this thread isn't >>> appearing on the mailing list? So far, I only see my post, with no >>> answers, although I have received 2 answers via email. It would be nice if >>> other people could see these answers as well. >>> >>> On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui wrote: >>> The existing algorithms operating on R data.frame can't simply operate on SparkR DataFrame. They have to be re-implemented to be based on SparkR DataFrame API. -Original Message- From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] Sent: Thursday, September 17, 2015 3:30 AM To: user@spark.apache.org Subject: SparkR - calling as.vector() with rdd dataframe causes error Hi, I have a library of clustering algorithms that I'm trying to run in the SparkR interactive shell. (I am working on a proof of concept for a document classification tool.) Each algorithm takes a term document matrix in the form of a dataframe. When I pass the method a local dataframe, the clustering algorithm works correctly, but when I pass it a spark rdd, it gives an error trying to coerce the data into a vector. Here is the code, that I'm calling within SparkR: # get matrix from a file file <- "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" #read it into variable raw_data <- read.csv(file,sep=',',header=FALSE) #convert to a local dataframe localDF = data.frame(raw_data) # create the rdd rdd <- createDataFrame(sqlContext,localDF) #call the algorithm with the localDF - this works result <- galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) #call with the rdd - this produces error result <- galileo(rdd, model='hclust',dist='euclidean',link='ward',K=5) Error in as.vector(data) : no method for coercing this S4 class to a vector I get the same error if I try to directly call as.vector(rdd) as well. Is there a reason why this works for localDF and not rdd? Should I be doing something else to coerce the object into a vector? Thanks, Ellen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org >>> >> >> >> -- >> Luciano Resende >> http://people.apache.org/~lresende >> http://twitter.com/lresende1975 >> http://lresende.blogspot.com/ >> > > -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: SparkR - calling as.vector() with rdd dataframe causes error
Thank you for the link! I was using http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see replies there. Regarding your code example, I'm doing the same thing and successfully creating the rdd, but the problem is that when I call a clustering algorithm like amap::hcluster(), I get an error from as.vector() that the rdd cannot be coerced into a vector. On Fri, Sep 18, 2015 at 12:33 PM, Luciano Resendewrote: > I see the thread with all the responses on the bottom at mail-archive : > > https://www.mail-archive.com/user%40spark.apache.org/msg36882.html > > On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller < > ellen.kraffmil...@gmail.com> wrote: > >> Thanks for your response. Is there a reason why this thread isn't >> appearing on the mailing list? So far, I only see my post, with no >> answers, although I have received 2 answers via email. It would be nice if >> other people could see these answers as well. >> >> On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui wrote: >> >>> The existing algorithms operating on R data.frame can't simply operate >>> on SparkR DataFrame. They have to be re-implemented to be based on SparkR >>> DataFrame API. >>> >>> -Original Message- >>> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] >>> Sent: Thursday, September 17, 2015 3:30 AM >>> To: user@spark.apache.org >>> Subject: SparkR - calling as.vector() with rdd dataframe causes error >>> >>> Hi, >>> I have a library of clustering algorithms that I'm trying to run in the >>> SparkR interactive shell. (I am working on a proof of concept for a >>> document classification tool.) Each algorithm takes a term document matrix >>> in the form of a dataframe. When I pass the method a local dataframe, the >>> clustering algorithm works correctly, but when I pass it a spark rdd, it >>> gives an error trying to coerce the data into a vector. Here is the code, >>> that I'm calling within SparkR: >>> >>> # get matrix from a file >>> file <- >>> >>> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" >>> >>> #read it into variable >>> raw_data <- read.csv(file,sep=',',header=FALSE) >>> >>> #convert to a local dataframe >>> localDF = data.frame(raw_data) >>> >>> # create the rdd >>> rdd <- createDataFrame(sqlContext,localDF) >>> >>> #call the algorithm with the localDF - this works result <- >>> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) >>> >>> #call with the rdd - this produces error result <- galileo(rdd, >>> model='hclust',dist='euclidean',link='ward',K=5) >>> >>> Error in as.vector(data) : >>> no method for coercing this S4 class to a vector >>> >>> >>> I get the same error if I try to directly call as.vector(rdd) as well. >>> >>> Is there a reason why this works for localDF and not rdd? Should I be >>> doing something else to coerce the object into a vector? >>> >>> Thanks, >>> Ellen >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >>> additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ >
Re: SparkR - calling as.vector() with rdd dataframe causes error
Thanks for your response. Is there a reason why this thread isn't appearing on the mailing list? So far, I only see my post, with no answers, although I have received 2 answers via email. It would be nice if other people could see these answers as well. On Thu, Sep 17, 2015 at 2:22 AM, Sun, Ruiwrote: > The existing algorithms operating on R data.frame can't simply operate on > SparkR DataFrame. They have to be re-implemented to be based on SparkR > DataFrame API. > > -Original Message- > From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] > Sent: Thursday, September 17, 2015 3:30 AM > To: user@spark.apache.org > Subject: SparkR - calling as.vector() with rdd dataframe causes error > > Hi, > I have a library of clustering algorithms that I'm trying to run in the > SparkR interactive shell. (I am working on a proof of concept for a > document classification tool.) Each algorithm takes a term document matrix > in the form of a dataframe. When I pass the method a local dataframe, the > clustering algorithm works correctly, but when I pass it a spark rdd, it > gives an error trying to coerce the data into a vector. Here is the code, > that I'm calling within SparkR: > > # get matrix from a file > file <- > > "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" > > #read it into variable > raw_data <- read.csv(file,sep=',',header=FALSE) > > #convert to a local dataframe > localDF = data.frame(raw_data) > > # create the rdd > rdd <- createDataFrame(sqlContext,localDF) > > #call the algorithm with the localDF - this works result <- > galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) > > #call with the rdd - this produces error result <- galileo(rdd, > model='hclust',dist='euclidean',link='ward',K=5) > > Error in as.vector(data) : > no method for coercing this S4 class to a vector > > > I get the same error if I try to directly call as.vector(rdd) as well. > > Is there a reason why this works for localDF and not rdd? Should I be > doing something else to coerce the object into a vector? > > Thanks, > Ellen > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >
Re: SparkR - calling as.vector() with rdd dataframe causes error
I see the thread with all the responses on the bottom at mail-archive : https://www.mail-archive.com/user%40spark.apache.org/msg36882.html On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller < ellen.kraffmil...@gmail.com> wrote: > Thanks for your response. Is there a reason why this thread isn't > appearing on the mailing list? So far, I only see my post, with no > answers, although I have received 2 answers via email. It would be nice if > other people could see these answers as well. > > On Thu, Sep 17, 2015 at 2:22 AM, Sun, Ruiwrote: > >> The existing algorithms operating on R data.frame can't simply operate on >> SparkR DataFrame. They have to be re-implemented to be based on SparkR >> DataFrame API. >> >> -Original Message- >> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] >> Sent: Thursday, September 17, 2015 3:30 AM >> To: user@spark.apache.org >> Subject: SparkR - calling as.vector() with rdd dataframe causes error >> >> Hi, >> I have a library of clustering algorithms that I'm trying to run in the >> SparkR interactive shell. (I am working on a proof of concept for a >> document classification tool.) Each algorithm takes a term document matrix >> in the form of a dataframe. When I pass the method a local dataframe, the >> clustering algorithm works correctly, but when I pass it a spark rdd, it >> gives an error trying to coerce the data into a vector. Here is the code, >> that I'm calling within SparkR: >> >> # get matrix from a file >> file <- >> >> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" >> >> #read it into variable >> raw_data <- read.csv(file,sep=',',header=FALSE) >> >> #convert to a local dataframe >> localDF = data.frame(raw_data) >> >> # create the rdd >> rdd <- createDataFrame(sqlContext,localDF) >> >> #call the algorithm with the localDF - this works result <- >> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) >> >> #call with the rdd - this produces error result <- galileo(rdd, >> model='hclust',dist='euclidean',link='ward',K=5) >> >> Error in as.vector(data) : >> no method for coercing this S4 class to a vector >> >> >> I get the same error if I try to directly call as.vector(rdd) as well. >> >> Is there a reason why this works for localDF and not rdd? Should I be >> doing something else to coerce the object into a vector? >> >> Thanks, >> Ellen >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >> commands, e-mail: user-h...@spark.apache.org >> >> > -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
RE: SparkR - calling as.vector() with rdd dataframe causes error
The existing algorithms operating on R data.frame can't simply operate on SparkR DataFrame. They have to be re-implemented to be based on SparkR DataFrame API. -Original Message- From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] Sent: Thursday, September 17, 2015 3:30 AM To: user@spark.apache.org Subject: SparkR - calling as.vector() with rdd dataframe causes error Hi, I have a library of clustering algorithms that I'm trying to run in the SparkR interactive shell. (I am working on a proof of concept for a document classification tool.) Each algorithm takes a term document matrix in the form of a dataframe. When I pass the method a local dataframe, the clustering algorithm works correctly, but when I pass it a spark rdd, it gives an error trying to coerce the data into a vector. Here is the code, that I'm calling within SparkR: # get matrix from a file file <- "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" #read it into variable raw_data <- read.csv(file,sep=',',header=FALSE) #convert to a local dataframe localDF = data.frame(raw_data) # create the rdd rdd <- createDataFrame(sqlContext,localDF) #call the algorithm with the localDF - this works result <- galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) #call with the rdd - this produces error result <- galileo(rdd, model='hclust',dist='euclidean',link='ward',K=5) Error in as.vector(data) : no method for coercing this S4 class to a vector I get the same error if I try to directly call as.vector(rdd) as well. Is there a reason why this works for localDF and not rdd? Should I be doing something else to coerce the object into a vector? Thanks, Ellen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR - calling as.vector() with rdd dataframe causes error
You can find some more info about SparkR at https://spark.apache.org/docs/latest/sparkr.html Looking at your sample app, with the provided content, you should be able to run it on SparkR with something like: #load SparkR with support for csv sparkR --packages com.databricks:spark-csv_2.10:1.0.3 sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) # get matrix from a file file <- "file:///./matrix.csv" #read it into variable raw_data <- read.csv(file,sep=',',header=FALSE) #convert to a local dataframe localDF = data.frame(raw_data) # create the rdd rdd <- createDataFrame(sqlContext,localDF) printSchema(rdd) head(rdd) I was also trying to read the csv directly in R : df <- read.df(sqlContext, file, "com.databricks.spark.csv", header="false", sep=",") That worked, but then I was getting exceptions when i tried printSchema(df) head(df) 15/09/17 18:33:30 ERROR CsvRelation$: Exception while parsing line: 7,8,9. java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:49) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:82) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:150) at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:130) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I will investigate this further and create a jira if necessary. On Wed, Sep 16, 2015 at 11:22 PM, Sun, Ruiwrote: > The existing algorithms operating on R data.frame can't simply operate on > SparkR DataFrame. They have to be re-implemented to be based on SparkR > DataFrame API. > > -Original Message- > From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] > Sent: Thursday, September 17, 2015 3:30 AM > To: user@spark.apache.org > Subject: SparkR - calling as.vector() with rdd dataframe causes error > > Hi, > I have a library of clustering algorithms that I'm trying to run in the > SparkR interactive shell. (I am working on a proof of concept for a > document classification tool.) Each algorithm takes a term document matrix > in the form of a dataframe. When I pass the method a local dataframe, the > clustering algorithm works correctly, but when I pass it a spark rdd, it > gives an error trying to coerce the data into a vector. Here is the code, > that I'm calling within SparkR: > > # get matrix from a file > file <- > > "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" > > #read it
Re: SparkR - calling as.vector() with rdd dataframe causes error
Also, just for completeness, matrix.csv contains: 1,2,3 4,5,6 7,8,9 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717p24719.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org