Hi All, I am trying to do word count on number of tweets, my first step is to get data from table using spark sql and then run split function on top of it to calculate word count.
Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd Spark Code that doesn't work to do word count:- val disitnct_tweets=hiveCtx.sql("select distinct(text) from tweets_table where text <> ''") val distinct_tweets_List=sc.parallelize(List(distinct_tweets)) //tried split on both the rdd disnt worked distinct_tweets.flatmap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) distinct_tweets_List.flatmap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) But when I output the data from sparksql to a file and load it again and run split it works. Example Code that works:- val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table where text <> ''") val distinct_tweets_op=distinct_tweets.collect() val rdd=sc.parallelize(distinct_tweets_op) rdd.saveAsTextFile("/home/cloudera/bdp/op") val textFile=sc.textFile("/home/cloudera/bdp/op/part-00000") val counts=textFile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) counts.SaveAsTextFile("/home/cloudera/bdp/wordcount") I don't want to write to file instead want to collect in a rdd and apply filter function on top of schema rdd, is there a way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org