Re: Splitting spark dstream into separate fields

Mich Talebzadeh Tue, 26 Apr 2016 04:43:46 -0700

many thanks

Dr Mich Talebzadeh




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 April 2016 at 12:23, Praveen Devarao <praveen...@in.ibm.com> wrote:

> Given that you are not specifying any key explicitly [usually people will
> user the Producer API and have a key value pair inserted] the key will be
> null....so your tuples would look like below
>
> (null, "ID       TIMESTAMP                           PRICE")
> (null, "40,20160426-080924,                  67.55738301621814598514")
>
> For values...the positions should be 0 indexed....hence (referring to your
> invocations) words1 will return value for TIMESTAMP and words2 will return
> value for PRICE
>
> >>I assume this is an array that can be handled as elements of an array as
> well?<<
>
> These are all still under your DStream...you will need to invoke action on
> the DStream to use them....for instance words.foreachRDD(.....)
>
> It should be easy for you to just run the streaming program and call print
> on each resulting DStream to understand what data is contained in it and
> decide how to make use of it.
>
> Thanking You
>
> ---------------------------------------------------------------------------------
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> ---------------------------------------------------------------------------------
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:        Mich Talebzadeh <mich.talebza...@gmail.com>
> To:        Praveen Devarao/India/IBM@IBMIN, "user @spark" <
> user@spark.apache.org>
> Date:        26/04/2016 04:03 pm
> Subject:        Re: Splitting spark dstream into separate fields
>
> ------------------------------
>
>
>
> Thanks Praveen.
>
> With regard to key/value pair. My kafka takes the following rows as input
>
> cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list
> rhes564:9092 --topic newtopic
>
> That ${IN_FILE} is the source of prices (1000  as follows
>
> ID       TIMESTAMP                           PRICE
> 40,     20160426-080924,                  67.55738301621814598514
>
> So tuples would be like below?
>
>  (1,"ID")
> (2, "TIMESTAMP")
> (3, "PRICE")
>
> For values
> val words1 = lines.map(_.split(',').view(1))
> val words2 = lines.map(_.split(',').view(2))
> val words3 = lines.map(_.split(',').view(3))
>
> So word1 will return value of ID, word2 will return value of TIMESTAMP and
> word3 will return value of PRICE?
>
> I assume this is an array that can be handled as elements of an array as
> well?
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn
> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> *http://talebzadehmich.wordpress.com*
> <http://talebzadehmich.wordpress.com/>
>
>
>
> On 26 April 2016 at 11:11, Praveen Devarao <*praveen...@in.ibm.com*
> <praveen...@in.ibm.com>> wrote:
> Hi Mich,
>
>         >>
>                 val lines = dstream.map(_._2)
>
>                 This maps the record into components? Is that the correct
> understanding of it
>         <<
>
>         Not sure what you refer to when said record into components. The
> above function is basically giving you the tuple (key/value pair) that you
> would have inserted into Kafka. say my Kafka producer puts data as
>
>         1=>"abc"
>         2 => "def"
>
>         Then the above map would give you tuples as below
>
>         (1,"abc")
>         (2,"abc")
>
>         >>
>                 The following splits the line into comma separated fields.
>
>                 val words = lines.map(_.split(',').view(2))
>         <<
>         Right, basically the value portion of your kafka data is being
> handled here
>
>         >>
>                 val words = lines.map(_.split(',').view(2))
>
>                 I am interested in column three So view(2) returns the
> value.
>
>                 I have also seen other ways like
>
>                 val words = lines.map(_.split(',').map(line => (line(0),
> (line(1),line(2) ...
>         <<
>
>         The split operation is returning back an array of String [a
> immutable *StringLike *collection]....calling the view method is creating
> a *IndexedSeqView *on the iterable while as in the second way you are
> iterating through it accessing the elements directly via the index position
> [line(0), line(1) ]. You would have to decide what is best for your use
> case based on evaluations should be lazy or immediate [see references
> below].
>
>         References:
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike>*,
> *
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView>
>
>
> Thanking You
>
> ---------------------------------------------------------------------------------
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> ---------------------------------------------------------------------------------
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:        Mich Talebzadeh <*mich.talebza...@gmail.com*
> <mich.talebza...@gmail.com>>
> To:        "user @spark" <*user@spark.apache.org* <user@spark.apache.org>>
> Date:        26/04/2016 12:58 pm
> Subject:        Splitting spark dstream into separate fields
> ------------------------------
>
>
>
> Hi,
>
> Is there any optimum way of splitting a dstream into components?
>
> I am doing Spark streaming and this the dstream I get
>
> val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topics)
>
> Now that dstream consists of 10,00 price lines per second like below
>
> ID, TIMESTAMP, PRICE
> 31,20160426-080924,93.53608929178084896656
>
> The columns are separated by commas/
>
> Now couple of questions:
>
> val lines = dstream.map(_._2)
>
> This maps the record into components? Is that the correct understanding of
> it
>
> The following splits the line into comma separated fields.
>
> val words = lines.map(_.split(',').view(2))
>
> I am interested in column three So view(2) returns the value.
>
> I have also seen other ways like
>
> val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2)
> ...
>
> line(0), line(1) refer to the position of the fields?
>
> Which one is the adopted one or the correct one?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> *http://talebzadehmich.wordpress.com*
> <http://talebzadehmich.wordpress.com/>
>
>
>
>

Re: Splitting spark dstream into separate fields

Reply via email to