Re: Splitting spark dstream into separate fields

Praveen Devarao Tue, 26 Apr 2016 03:13:52 -0700

Hi Mich,

        >> 
                val lines = dstream.map(_._2)


                This maps the record into components? Is that the correct 
understanding of it
        <<

        Not sure what you refer to when said record into components. The 
above function is basically giving you the tuple (key/value pair) that you 
would have inserted into Kafka. say my Kafka producer puts data as 

        1=>"abc"
        2 => "def"

        Then the above map would give you tuples as below

        (1,"abc")
        (2,"abc")

        >> 
                The following splits the line into comma separated fields. 


                val words = lines.map(_.split(',').view(2))
        <<
        Right, basically the value portion of your kafka data is being 
handled here

        >>
                val words = lines.map(_.split(',').view(2))

                I am interested in column three So view(2) returns the 
value.
 
                I have also seen other ways like

                val words = lines.map(_.split(',').map(line => (line(0), 
(line(1),line(2) ...
        <<

        The split operation is returning back an array of String [a 
immutable StringLike collection]....calling the view method is creating a 
IndexedSeqView on the iterable while as in the second way you are 
iterating through it accessing the elements directly via the index 
position [line(0), line(1) ]. You would have to decide what is best for 
your use case based on evaluations should be lazy or immediate [see 
references below].

        References: 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike
 
, 
http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView
 

 
 
Thanking You
---------------------------------------------------------------------------------
Praveen Devarao
Spark Technology Centre
IBM India Software Labs
---------------------------------------------------------------------------------
"Courage doesn't always roar. Sometimes courage is the quiet voice at the 
end of the day saying I will try again"



From:   Mich Talebzadeh <mich.talebza...@gmail.com>
To:     "user @spark" <user@spark.apache.org>
Date:   26/04/2016 12:58 pm
Subject:        Splitting spark dstream into separate fields



Hi,

Is there any optimum way of splitting a dstream into components?

I am doing Spark streaming and this the dstream I get

val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topics)

Now that dstream consists of 10,00 price lines per second like below

ID, TIMESTAMP, PRICE
31,20160426-080924,93.53608929178084896656

The columns are separated by commas/

Now couple of questions:

val lines = dstream.map(_._2)

This maps the record into components? Is that the correct understanding of 
it

The following splits the line into comma separated fields. 

val words = lines.map(_.split(',').view(2))

I am interested in column three So view(2) returns the value.

I have also seen other ways like

val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2) 
...

line(0), line(1) refer to the position of the fields?

Which one is the adopted one or the correct one?

Thanks


Dr Mich Talebzadeh
 
LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
http://talebzadehmich.wordpress.com

Re: Splitting spark dstream into separate fields

Reply via email to