Hi All, 

I am trying to do word count on number of tweets, my first step is to get
data from table using spark sql and then run split function on top of it to
calculate word count.

Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd

Spark Code that doesn't work to do word count:-

val disitnct_tweets=hiveCtx.sql("select distinct(text) from tweets_table
where text <> ''")
val distinct_tweets_List=sc.parallelize(List(distinct_tweets))
//tried split on both the rdd disnt worked
distinct_tweets.flatmap(line => line.split(" ")).map(word =>
(word,1)).reduceByKey(_+_)
distinct_tweets_List.flatmap(line => line.split(" ")).map(word =>
(word,1)).reduceByKey(_+_)

But when I output the data from sparksql to a file and load it again and run
split it works.

Example Code that works:-

val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table
where text <> ''")
val distinct_tweets_op=distinct_tweets.collect()
val rdd=sc.parallelize(distinct_tweets_op)
rdd.saveAsTextFile("/home/cloudera/bdp/op")
val textFile=sc.textFile("/home/cloudera/bdp/op/part-00000")
val counts=textFile.flatMap(line => line.split(" ")).map(word =>
(word,1)).reduceByKey(_+_)
counts.SaveAsTextFile("/home/cloudera/bdp/wordcount")

I don't want to write to file instead want to collect in a rdd and apply
filter function on top of schema rdd, is there a way.

Thanks 







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to