BTW flatmap is misspelled. See RDD.scala: def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
On Sat, May 23, 2015 at 8:52 AM, Ted Yu <yuzhih...@gmail.com> wrote: > hiveCtx.sql() returns DataFrame which doesn't have split method. > > The columns of a row in the result can be accessed by field index: > line => line(0).split(" ") > > Cheers > > On Sat, May 23, 2015 at 5:16 AM, kali.tumm...@gmail.com < > kali.tumm...@gmail.com> wrote: > >> Hi All, >> >> I am trying to do word count on number of tweets, my first step is to get >> data from table using spark sql and then run split function on top of it >> to >> calculate word count. >> >> Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd >> >> Spark Code that doesn't work to do word count:- >> >> val disitnct_tweets=hiveCtx.sql("select distinct(text) from tweets_table >> where text <> ''") >> val distinct_tweets_List=sc.parallelize(List(distinct_tweets)) >> //tried split on both the rdd disnt worked >> distinct_tweets.flatmap(line => line.split(" ")).map(word => >> (word,1)).reduceByKey(_+_) >> distinct_tweets_List.flatmap(line => line.split(" ")).map(word => >> (word,1)).reduceByKey(_+_) >> >> But when I output the data from sparksql to a file and load it again and >> run >> split it works. >> >> Example Code that works:- >> >> val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table >> where text <> ''") >> val distinct_tweets_op=distinct_tweets.collect() >> val rdd=sc.parallelize(distinct_tweets_op) >> rdd.saveAsTextFile("/home/cloudera/bdp/op") >> val textFile=sc.textFile("/home/cloudera/bdp/op/part-00000") >> val counts=textFile.flatMap(line => line.split(" ")).map(word => >> (word,1)).reduceByKey(_+_) >> counts.SaveAsTextFile("/home/cloudera/bdp/wordcount") >> >> I don't want to write to file instead want to collect in a rdd and apply >> filter function on top of schema rdd, is there a way. >> >> Thanks >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >