Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-22 Thread Luciano Resende
localDF is a pure R data frame and as.vector will work with no problems, as
for calling it in the SparkR objects, try calling collect before you call
as.vector (or in your case, the algorithms), that should solve your problem.

On Mon, Sep 21, 2015 at 8:48 AM, Ellen Kraffmiller <
ellen.kraffmil...@gmail.com> wrote:

> Thank you for the link! I was using
> http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see
> replies there.
>
> Regarding your code example, I'm doing the same thing and successfully
> creating the rdd, but the problem is that when I call a clustering
> algorithm like amap::hcluster(), I get an error from as.vector() that the
> rdd cannot be coerced into a vector.
>
> On Fri, Sep 18, 2015 at 12:33 PM, Luciano Resende 
> wrote:
>
>> I see the thread with all the responses on the bottom at mail-archive :
>>
>> https://www.mail-archive.com/user%40spark.apache.org/msg36882.html
>>
>> On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller <
>> ellen.kraffmil...@gmail.com> wrote:
>>
>>> Thanks for your response.  Is there a reason why this thread isn't
>>> appearing on the mailing list?  So far, I only see my post, with no
>>> answers, although I have received 2 answers via email.  It would be nice if
>>> other people could see these answers as well.
>>>
>>> On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui  wrote:
>>>
 The existing algorithms operating on R data.frame can't simply operate
 on SparkR DataFrame. They have to be re-implemented to be based on SparkR
 DataFrame API.

 -Original Message-
 From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
 Sent: Thursday, September 17, 2015 3:30 AM
 To: user@spark.apache.org
 Subject: SparkR - calling as.vector() with rdd dataframe causes error

 Hi,
 I have a library of clustering algorithms that I'm trying to run in the
 SparkR interactive shell. (I am working on a proof of concept for a
 document classification tool.) Each algorithm takes a term document matrix
 in the form of a dataframe.  When I pass the method a local dataframe, the
 clustering algorithm works correctly, but when I pass it a spark rdd, it
 gives an error trying to coerce the data into a vector.  Here is the code,
 that I'm calling within SparkR:

 # get matrix from a file
 file <-

 "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"

 #read it into variable
  raw_data <- read.csv(file,sep=',',header=FALSE)

 #convert to a local dataframe
 localDF = data.frame(raw_data)

 # create the rdd
 rdd  <- createDataFrame(sqlContext,localDF)

 #call the algorithm with the localDF - this works result <-
 galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5)

 #call with the rdd - this produces error result <- galileo(rdd,
 model='hclust',dist='euclidean',link='ward',K=5)

 Error in as.vector(data) :
   no method for coercing this S4 class to a vector


 I get the same error if I try to directly call as.vector(rdd) as well.

 Is there a reason why this works for localDF and not rdd?  Should I be
 doing something else to coerce the object into a vector?

 Thanks,
 Ellen



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-21 Thread Ellen Kraffmiller
Thank you for the link! I was using
http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see
replies there.

Regarding your code example, I'm doing the same thing and successfully
creating the rdd, but the problem is that when I call a clustering
algorithm like amap::hcluster(), I get an error from as.vector() that the
rdd cannot be coerced into a vector.

On Fri, Sep 18, 2015 at 12:33 PM, Luciano Resende 
wrote:

> I see the thread with all the responses on the bottom at mail-archive :
>
> https://www.mail-archive.com/user%40spark.apache.org/msg36882.html
>
> On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller <
> ellen.kraffmil...@gmail.com> wrote:
>
>> Thanks for your response.  Is there a reason why this thread isn't
>> appearing on the mailing list?  So far, I only see my post, with no
>> answers, although I have received 2 answers via email.  It would be nice if
>> other people could see these answers as well.
>>
>> On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui  wrote:
>>
>>> The existing algorithms operating on R data.frame can't simply operate
>>> on SparkR DataFrame. They have to be re-implemented to be based on SparkR
>>> DataFrame API.
>>>
>>> -Original Message-
>>> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
>>> Sent: Thursday, September 17, 2015 3:30 AM
>>> To: user@spark.apache.org
>>> Subject: SparkR - calling as.vector() with rdd dataframe causes error
>>>
>>> Hi,
>>> I have a library of clustering algorithms that I'm trying to run in the
>>> SparkR interactive shell. (I am working on a proof of concept for a
>>> document classification tool.) Each algorithm takes a term document matrix
>>> in the form of a dataframe.  When I pass the method a local dataframe, the
>>> clustering algorithm works correctly, but when I pass it a spark rdd, it
>>> gives an error trying to coerce the data into a vector.  Here is the code,
>>> that I'm calling within SparkR:
>>>
>>> # get matrix from a file
>>> file <-
>>>
>>> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"
>>>
>>> #read it into variable
>>>  raw_data <- read.csv(file,sep=',',header=FALSE)
>>>
>>> #convert to a local dataframe
>>> localDF = data.frame(raw_data)
>>>
>>> # create the rdd
>>> rdd  <- createDataFrame(sqlContext,localDF)
>>>
>>> #call the algorithm with the localDF - this works result <-
>>> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5)
>>>
>>> #call with the rdd - this produces error result <- galileo(rdd,
>>> model='hclust',dist='euclidean',link='ward',K=5)
>>>
>>> Error in as.vector(data) :
>>>   no method for coercing this S4 class to a vector
>>>
>>>
>>> I get the same error if I try to directly call as.vector(rdd) as well.
>>>
>>> Is there a reason why this works for localDF and not rdd?  Should I be
>>> doing something else to coerce the object into a vector?
>>>
>>> Thanks,
>>> Ellen
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>>> additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-18 Thread Ellen Kraffmiller
Thanks for your response.  Is there a reason why this thread isn't
appearing on the mailing list?  So far, I only see my post, with no
answers, although I have received 2 answers via email.  It would be nice if
other people could see these answers as well.

On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui  wrote:

> The existing algorithms operating on R data.frame can't simply operate on
> SparkR DataFrame. They have to be re-implemented to be based on SparkR
> DataFrame API.
>
> -Original Message-
> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
> Sent: Thursday, September 17, 2015 3:30 AM
> To: user@spark.apache.org
> Subject: SparkR - calling as.vector() with rdd dataframe causes error
>
> Hi,
> I have a library of clustering algorithms that I'm trying to run in the
> SparkR interactive shell. (I am working on a proof of concept for a
> document classification tool.) Each algorithm takes a term document matrix
> in the form of a dataframe.  When I pass the method a local dataframe, the
> clustering algorithm works correctly, but when I pass it a spark rdd, it
> gives an error trying to coerce the data into a vector.  Here is the code,
> that I'm calling within SparkR:
>
> # get matrix from a file
> file <-
>
> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"
>
> #read it into variable
>  raw_data <- read.csv(file,sep=',',header=FALSE)
>
> #convert to a local dataframe
> localDF = data.frame(raw_data)
>
> # create the rdd
> rdd  <- createDataFrame(sqlContext,localDF)
>
> #call the algorithm with the localDF - this works result <-
> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5)
>
> #call with the rdd - this produces error result <- galileo(rdd,
> model='hclust',dist='euclidean',link='ward',K=5)
>
> Error in as.vector(data) :
>   no method for coercing this S4 class to a vector
>
>
> I get the same error if I try to directly call as.vector(rdd) as well.
>
> Is there a reason why this works for localDF and not rdd?  Should I be
> doing something else to coerce the object into a vector?
>
> Thanks,
> Ellen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-18 Thread Luciano Resende
I see the thread with all the responses on the bottom at mail-archive :

https://www.mail-archive.com/user%40spark.apache.org/msg36882.html

On Fri, Sep 18, 2015 at 7:58 AM, Ellen Kraffmiller <
ellen.kraffmil...@gmail.com> wrote:

> Thanks for your response.  Is there a reason why this thread isn't
> appearing on the mailing list?  So far, I only see my post, with no
> answers, although I have received 2 answers via email.  It would be nice if
> other people could see these answers as well.
>
> On Thu, Sep 17, 2015 at 2:22 AM, Sun, Rui  wrote:
>
>> The existing algorithms operating on R data.frame can't simply operate on
>> SparkR DataFrame. They have to be re-implemented to be based on SparkR
>> DataFrame API.
>>
>> -Original Message-
>> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
>> Sent: Thursday, September 17, 2015 3:30 AM
>> To: user@spark.apache.org
>> Subject: SparkR - calling as.vector() with rdd dataframe causes error
>>
>> Hi,
>> I have a library of clustering algorithms that I'm trying to run in the
>> SparkR interactive shell. (I am working on a proof of concept for a
>> document classification tool.) Each algorithm takes a term document matrix
>> in the form of a dataframe.  When I pass the method a local dataframe, the
>> clustering algorithm works correctly, but when I pass it a spark rdd, it
>> gives an error trying to coerce the data into a vector.  Here is the code,
>> that I'm calling within SparkR:
>>
>> # get matrix from a file
>> file <-
>>
>> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"
>>
>> #read it into variable
>>  raw_data <- read.csv(file,sep=',',header=FALSE)
>>
>> #convert to a local dataframe
>> localDF = data.frame(raw_data)
>>
>> # create the rdd
>> rdd  <- createDataFrame(sqlContext,localDF)
>>
>> #call the algorithm with the localDF - this works result <-
>> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5)
>>
>> #call with the rdd - this produces error result <- galileo(rdd,
>> model='hclust',dist='euclidean',link='ward',K=5)
>>
>> Error in as.vector(data) :
>>   no method for coercing this S4 class to a vector
>>
>>
>> I get the same error if I try to directly call as.vector(rdd) as well.
>>
>> Is there a reason why this works for localDF and not rdd?  Should I be
>> doing something else to coerce the object into a vector?
>>
>> Thanks,
>> Ellen
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


RE: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Sun, Rui
The existing algorithms operating on R data.frame can't simply operate on 
SparkR DataFrame. They have to be re-implemented to be based on SparkR 
DataFrame API.

-Original Message-
From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] 
Sent: Thursday, September 17, 2015 3:30 AM
To: user@spark.apache.org
Subject: SparkR - calling as.vector() with rdd dataframe causes error

Hi,
I have a library of clustering algorithms that I'm trying to run in the SparkR 
interactive shell. (I am working on a proof of concept for a document 
classification tool.) Each algorithm takes a term document matrix in the form 
of a dataframe.  When I pass the method a local dataframe, the clustering 
algorithm works correctly, but when I pass it a spark rdd, it gives an error 
trying to coerce the data into a vector.  Here is the code, that I'm calling 
within SparkR:

# get matrix from a file
file <-
"/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"

#read it into variable
 raw_data <- read.csv(file,sep=',',header=FALSE)

#convert to a local dataframe
localDF = data.frame(raw_data)

# create the rdd
rdd  <- createDataFrame(sqlContext,localDF)

#call the algorithm with the localDF - this works result <- galileo(localDF, 
model='hclust',dist='euclidean',link='ward',K=5)

#call with the rdd - this produces error result <- galileo(rdd, 
model='hclust',dist='euclidean',link='ward',K=5)

Error in as.vector(data) : 
  no method for coercing this S4 class to a vector


I get the same error if I try to directly call as.vector(rdd) as well.

Is there a reason why this works for localDF and not rdd?  Should I be doing 
something else to coerce the object into a vector?

Thanks,
Ellen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-17 Thread Luciano Resende
You can find some more info about SparkR at
https://spark.apache.org/docs/latest/sparkr.html

Looking at your sample app, with the provided content, you should be able
to run it on SparkR with something like:

#load SparkR with support for csv
sparkR --packages com.databricks:spark-csv_2.10:1.0.3

sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)

# get matrix from a file
file <- "file:///./matrix.csv"

#read it into variable
raw_data <- read.csv(file,sep=',',header=FALSE)

#convert to a local dataframe
localDF = data.frame(raw_data)

# create the rdd
rdd  <- createDataFrame(sqlContext,localDF)

printSchema(rdd)
head(rdd)

I was also trying to read the csv directly in R :
df <-  read.df(sqlContext, file, "com.databricks.spark.csv",
header="false", sep=",")

That worked, but then I was getting exceptions when i tried
printSchema(df)
head(df)

15/09/17 18:33:30 ERROR CsvRelation$: Exception while parsing line: 7,8,9.
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.spark.unsafe.types.UTF8String
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
at
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:49)
at
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247)
at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:82)
at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:61)
at
com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:150)
at
com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:130)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I will investigate this further and create a jira if necessary.

On Wed, Sep 16, 2015 at 11:22 PM, Sun, Rui  wrote:

> The existing algorithms operating on R data.frame can't simply operate on
> SparkR DataFrame. They have to be re-implemented to be based on SparkR
> DataFrame API.
>
> -Original Message-
> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
> Sent: Thursday, September 17, 2015 3:30 AM
> To: user@spark.apache.org
> Subject: SparkR - calling as.vector() with rdd dataframe causes error
>
> Hi,
> I have a library of clustering algorithms that I'm trying to run in the
> SparkR interactive shell. (I am working on a proof of concept for a
> document classification tool.) Each algorithm takes a term document matrix
> in the form of a dataframe.  When I pass the method a local dataframe, the
> clustering algorithm works correctly, but when I pass it a spark rdd, it
> gives an error trying to coerce the data into a vector.  Here is the code,
> that I'm calling within SparkR:
>
> # get matrix from a file
> file <-
>
> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"
>
> #read it 

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-16 Thread ekraffmiller
Also, just for completeness, matrix.csv contains:
1,2,3
4,5,6
7,8,9



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-calling-as-vector-with-rdd-dataframe-causes-error-tp24717p24719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org