Re: choice of RDD function

Jacek Laskowski Wed, 15 Jun 2016 07:11:08 -0700

Hi,

Ad Q1, yes. See stateful operators like mapWithState and windows.


Ad Q2, RDDs should be fine (and available out of the box), but I'd give
Datasets a try too since they're .toDF away.

Jacek
On 14 Jun 2016 10:29 p.m., "Sivakumaran S" <siva.kuma...@me.com> wrote:

Dear friends,

I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is
sending a json string periodically to a topic in kafka. I am able to
consume this topic using Spark Streaming and print it. The schema of the
source json is as follows:

{ “id”: 121156, “ht”: 42, “rotor_rpm”: 180, “temp”: 14.2, “time”:146593512}
{ “id”: 121157, “ht”: 18, “rotor_rpm”: 110, “temp”: 12.2, “time”: 146593512}
{ “id”: 121156, “ht”: 36, “rotor_rpm”: 160, “temp”: 14.4, “time”: 146593513}
{ “id”: 121157, “ht”: 19, “rotor_rpm”: 120, “temp”: 12.0, “time”: 146593513}
and so on.


In Spark streaming, I want to find the *average* of “ht” (height),
“rotor_rpm” and “temp” for each “id". I also want to find the max and min
of the same fields in the time window (300 seconds in this case).

Q1. Can this be done using plain RDD and streaming functions or does it
require Dataframes/SQL? There may be more fields added to the json at a
later stage. There will be a lot of “id”s at a later stage.

Q2. If it can be done using either, which one would render to be more
efficient and fast?

As of now, the entire set up is in a single laptop.

Thanks in advance.

Regards,

Siva

Re: choice of RDD function

Reply via email to