Re: split function on spark sql created rdd

Ted Yu Sat, 23 May 2015 09:36:06 -0700

BTW flatmap is misspelled.
See RDD.scala:

  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {


On Sat, May 23, 2015 at 8:52 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> hiveCtx.sql() returns DataFrame which doesn't have split method.
>
> The columns of a row in the result can be accessed by field index:
> line => line(0).split(" ")
>
> Cheers
>
> On Sat, May 23, 2015 at 5:16 AM, kali.tumm...@gmail.com <
> kali.tumm...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying to do word count on number of tweets, my first step is to get
>> data from table using spark sql and then run split function on top of it
>> to
>> calculate word count.
>>
>> Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd
>>
>> Spark Code that doesn't work to do word count:-
>>
>> val disitnct_tweets=hiveCtx.sql("select distinct(text) from tweets_table
>> where text <> ''")
>> val distinct_tweets_List=sc.parallelize(List(distinct_tweets))
>> //tried split on both the rdd disnt worked
>> distinct_tweets.flatmap(line => line.split(" ")).map(word =>
>> (word,1)).reduceByKey(_+_)
>> distinct_tweets_List.flatmap(line => line.split(" ")).map(word =>
>> (word,1)).reduceByKey(_+_)
>>
>> But when I output the data from sparksql to a file and load it again and
>> run
>> split it works.
>>
>> Example Code that works:-
>>
>> val distinct_tweets=hiveCtx.sql("select dsitinct(text) from tweets_table
>> where text <> ''")
>> val distinct_tweets_op=distinct_tweets.collect()
>> val rdd=sc.parallelize(distinct_tweets_op)
>> rdd.saveAsTextFile("/home/cloudera/bdp/op")
>> val textFile=sc.textFile("/home/cloudera/bdp/op/part-00000")
>> val counts=textFile.flatMap(line => line.split(" ")).map(word =>
>> (word,1)).reduceByKey(_+_)
>> counts.SaveAsTextFile("/home/cloudera/bdp/wordcount")
>>
>> I don't want to write to file instead want to collect in a rdd and apply
>> filter function on top of schema rdd, is there a way.
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: split function on spark sql created rdd

Reply via email to