Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-26 Thread ??????????
Hi Kodali,


I feel puzzled about the 
"Kafka Streaming can indeed do map, reduce, join and window operations ".


Do you mean Kafka have API like map or Kafka do't have API but Kafka can do it 
please?
In my memory, kafka do not have API like map and so on.




 
---Original---
From: "kant kodali"<kanth...@gmail.com>
Date: 2017/6/12 06:41:27
To: "Mohammed Guller"<moham...@glassbeam.com>;
Cc: "user"<user@spark.apache.org>;"yohann 
jardin"<yohannjar...@hotmail.com>;"vaquar khan"<vaquar.k...@gmail.com>;"vincent 
gromakowski"<vincent.gromakow...@gmail.com>;
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?


Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali <kanth...@gmail.com> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense. 

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.


The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"


Above all, please correct me if I am wrong :) 








On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <moham...@glassbeam.com> 
wrote:
  

  

  
Just to elaborate more on Vincent wrote ?C Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use  case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.
 
 
 
A few more things to consider:
  
If you don??t already have a Spark cluster, but have Kafka cluster, it may be 
easier to use Kafka streaming since you don??t need to setup  and manage 
another cluster. 

On the other hand, if you already have a spark cluster, but don??t have a Kafka 
cluster (in case you are using some other messaging system),  Spark streaming 
is a better option.

If you already know and use Spark, you may find it easier to program with Spark 
Streaming API even if you are using Kafka. 

Spark Streaming may give you better throughput. So you have to decide what is 
more important for your stream processing application ?C latency  or throughput?

Kafka streaming is relatively new and less mature than Spark Streaming
 
 
  
Mohammed
 
 
 

From: vincent gromakowski [mailto:vincent.gromakow...@gmail.com] 
 Sent: Sunday, June 11, 2017 12:09 PM
 To: yohann jardin <yohannjar...@hotmail.com>
 Cc: kant kodali <kanth...@gmail.com>; vaquar khan <vaquar.k...@gmail.com>; 
user <user@spark.apache.org>
 Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?
 
 
 
 

I think Kafka streams is good when the processing of each row is independant 
from each other (row parsing, data cleaning...)
 
  
Spark is better when processing group of rows (group by, ml, window func...)
 
   
 
  
Le 11 juin 2017 8:15 PM, "yohann jardin" <yohannjar...@hotmail.com> a ??crit :

Hey,
 
Kafka can also do streaming on its own:  
https://kafka.apache.org/documentation/streams
 I don??t know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic. 
  
Yohann Jardin
 
  
Le 6/11/2017 ?? 7:08 PM, vaquar khan a ??crit :
 
 

 Hi Kant,
 
 Kafka is the message broker that using as Producers and Consumers and Spark 
Strea

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-15 Thread Michael Armbrust
Continuous processing is still a work in progress.  I would really like to
at least have a basic version in Spark 2.3.

The announcement about 2.2 is that we are planning to remove the
experimental tag from Structured Streaming.

On Thu, Jun 15, 2017 at 11:53 AM, kant kodali  wrote:

> vow! you caught the 007!  Is continuous processing mode available in 2.2?
> The ticket says the target version is 2.3 but the talk in the Video says
> 2.2 and beyond so I am just curious if it is available in 2.2 or should I
> try it from the latest build?
>
> Thanks!
>
> On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust 
> wrote:
>
>> This a good question. I really like using Kafka as a centralized source
>> for streaming data in an organization and, with Spark 2.2, we have full
>> support for reading and writing data to/from Kafka in both streaming and
>> batch
>> .
>> I'll focus here on what I think the advantages are of Structured Streaming
>> over Kafka Streams (a stream processing library that reads from Kafka).
>>
>>  - *High level productive APIs* - Streaming queries in Spark can be
>> expressed using DataFrames, Datasets or even plain SQL.  Streaming
>> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
>> that for common operations like filtering, joining, aggregating, you can
>> use built-in operations.  For complicated custom logic you can use UDFs and
>> lambda functions. In contrast, Kafka Streams mostly requires you to express
>> your transformations using lambda functions.
>>  - *High Performance* - Since it is built on Spark SQL, streaming
>> queries take advantage of the Catalyst optimizer and the Tungsten execution
>> engine. This design leads to huge performance wins
>> ,
>> which means you need less hardware to accomplish the same job.
>>  - *Ecosystem* - Spark has connectors for working with all kinds of data
>> stored in a variety of systems.  This means you can join a stream with data
>> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
>> also means that if you decide that you don't want to manage a Kafka cluster
>> anymore and would rather use Kinesis, you can do that too.  We recently
>> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
>> a few lines of code! I think its likely that in the future Spark will also
>> have connectors for Google's PubSub and Azure's streaming offerings.
>>
>> Regarding latency, there has been a lot of discussion about the inherent
>> latencies of micro-batch.  Fortunately, we were very careful to leave
>> batching out of the user facing API, and as we demo'ed last week, this
>> makes it possible for the Spark Streaming to achieve sub-millisecond
>> latencies .  Watch
>> SPARK-20928  for more
>> on this effort to eliminate micro-batch from Spark's execution model.
>>
>> At the far other end of the latency spectrum...  For those with jobs that
>> run in the cloud on data that arrives sporadically, you can run streaming
>> jobs that only execute every few hours or every few days, shutting the
>> cluster down in between.  This architecture can result in a huge cost
>> savings for some applications
>> 
>> .
>>
>> Michael
>>
>> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> I am trying hard to figure out what is the real difference between Kafka
>>> Streaming vs Spark Streaming other than saying one can be used as part of
>>> Micro services (since Kafka streaming is just a library) and the other is a
>>> Standalone framework by itself.
>>>
>>> If I can accomplish same job one way or other this is a sort of a
>>> puzzling question for me so it would be great to know what Spark streaming
>>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>>
>>> Thanks!
>>>
>>>
>>
>


Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-15 Thread kant kodali
vow! you caught the 007!  Is continuous processing mode available in 2.2?
The ticket says the target version is 2.3 but the talk in the Video says
2.2 and beyond so I am just curious if it is available in 2.2 or should I
try it from the latest build?

Thanks!

On Wed, Jun 14, 2017 at 5:32 PM, Michael Armbrust 
wrote:

> This a good question. I really like using Kafka as a centralized source
> for streaming data in an organization and, with Spark 2.2, we have full
> support for reading and writing data to/from Kafka in both streaming and
> batch
> .
> I'll focus here on what I think the advantages are of Structured Streaming
> over Kafka Streams (a stream processing library that reads from Kafka).
>
>  - *High level productive APIs* - Streaming queries in Spark can be
> expressed using DataFrames, Datasets or even plain SQL.  Streaming
> DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
> that for common operations like filtering, joining, aggregating, you can
> use built-in operations.  For complicated custom logic you can use UDFs and
> lambda functions. In contrast, Kafka Streams mostly requires you to express
> your transformations using lambda functions.
>  - *High Performance* - Since it is built on Spark SQL, streaming queries
> take advantage of the Catalyst optimizer and the Tungsten execution engine.
> This design leads to huge performance wins
> ,
> which means you need less hardware to accomplish the same job.
>  - *Ecosystem* - Spark has connectors for working with all kinds of data
> stored in a variety of systems.  This means you can join a stream with data
> encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
> also means that if you decide that you don't want to manage a Kafka cluster
> anymore and would rather use Kinesis, you can do that too.  We recently
> moved a bunch of our pipelines from Kafka to Kinesis and had to only change
> a few lines of code! I think its likely that in the future Spark will also
> have connectors for Google's PubSub and Azure's streaming offerings.
>
> Regarding latency, there has been a lot of discussion about the inherent
> latencies of micro-batch.  Fortunately, we were very careful to leave
> batching out of the user facing API, and as we demo'ed last week, this
> makes it possible for the Spark Streaming to achieve sub-millisecond
> latencies .  Watch
> SPARK-20928  for more
> on this effort to eliminate micro-batch from Spark's execution model.
>
> At the far other end of the latency spectrum...  For those with jobs that
> run in the cloud on data that arrives sporadically, you can run streaming
> jobs that only execute every few hours or every few days, shutting the
> cluster down in between.  This architecture can result in a huge cost
> savings for some applications
> 
> .
>
> Michael
>
> On Sun, Jun 11, 2017 at 1:12 AM, kant kodali  wrote:
>
>> Hi All,
>>
>> I am trying hard to figure out what is the real difference between Kafka
>> Streaming vs Spark Streaming other than saying one can be used as part of
>> Micro services (since Kafka streaming is just a library) and the other is a
>> Standalone framework by itself.
>>
>> If I can accomplish same job one way or other this is a sort of a
>> puzzling question for me so it would be great to know what Spark streaming
>> can do that Kafka Streaming cannot do efficiently or whatever ?
>>
>> Thanks!
>>
>>
>


Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-14 Thread Michael Armbrust
This a good question. I really like using Kafka as a centralized source for
streaming data in an organization and, with Spark 2.2, we have full support
for reading and writing data to/from Kafka in both streaming and batch
.
I'll focus here on what I think the advantages are of Structured Streaming
over Kafka Streams (a stream processing library that reads from Kafka).

 - *High level productive APIs* - Streaming queries in Spark can be
expressed using DataFrames, Datasets or even plain SQL.  Streaming
DataFrames/SQL are supported in Scala, Java, Python and even R.  This means
that for common operations like filtering, joining, aggregating, you can
use built-in operations.  For complicated custom logic you can use UDFs and
lambda functions. In contrast, Kafka Streams mostly requires you to express
your transformations using lambda functions.
 - *High Performance* - Since it is built on Spark SQL, streaming queries
take advantage of the Catalyst optimizer and the Tungsten execution engine.
This design leads to huge performance wins
,
which means you need less hardware to accomplish the same job.
 - *Ecosystem* - Spark has connectors for working with all kinds of data
stored in a variety of systems.  This means you can join a stream with data
encoded in parquet and stored in S3/HDFS.  Perhaps more importantly, it
also means that if you decide that you don't want to manage a Kafka cluster
anymore and would rather use Kinesis, you can do that too.  We recently
moved a bunch of our pipelines from Kafka to Kinesis and had to only change
a few lines of code! I think its likely that in the future Spark will also
have connectors for Google's PubSub and Azure's streaming offerings.

Regarding latency, there has been a lot of discussion about the inherent
latencies of micro-batch.  Fortunately, we were very careful to leave
batching out of the user facing API, and as we demo'ed last week, this
makes it possible for the Spark Streaming to achieve sub-millisecond
latencies .  Watch SPARK-20928
 for more on this effort
to eliminate micro-batch from Spark's execution model.

At the far other end of the latency spectrum...  For those with jobs that
run in the cloud on data that arrives sporadically, you can run streaming
jobs that only execute every few hours or every few days, shutting the
cluster down in between.  This architecture can result in a huge cost
savings for some applications

.

Michael

On Sun, Jun 11, 2017 at 1:12 AM, kant kodali  wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>


Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-13 Thread Paolo Patierno
I think that a big advantage to not use Spark Streaming when your solution is 
already based on Kafka is that you don't have to deal with another cluster. I 
mean ...
Imagine that your solution is already based on Kafka as ingestion systems for 
your events and then you need to do some real time analysis with streams. 
Adding Spark means adding a new cluster with a master and one or more nodes 
then Spark will distribute jobs for you. Using the lightweight streams library 
from Kafka means just developing a new application for getting events from the 
same cluster. You can deploy more instances of the same application for load 
balancing and all is done always by Kafka itself.
I think that in terms of deployment this is a big advantage of using Kafka 
stream in the same Kafka cluster instead of adding Spark.

Paolo

From: kant kodali <kanth...@gmail.com>
Sent: Monday, June 12, 2017 12:40:37 AM
To: Mohammed Guller
Cc: vincent gromakowski; yohann jardin; vaquar khan; user
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski 
[mailto:vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>>
Cc: kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>>; vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

I think Kafka streams is good when the processing of each row is independant 
from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin&quo

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-12 Thread Mohammed Guller
Regarding Spark scheduler – if you are referring to the ability to distribute 
workload and scale, Kafka Streaming also provides that capability. It is 
deceptively simple in that regard if you already have a Kafka cluster. You can 
launch multiple instances of your Kafka streaming application and Kafka 
streaming will automatically balance the workload across different instances. 
It rebalances workload as you add or remove instances. Similarly, if an 
instance fails or crash, it will automatically detect that.

Regarding real-time – rather than debating which one is real-time, I would look 
at the latency requirements of my application. For most applications, the near 
real time capabilities of Spark Streaming might be good enough. For others, it 
may not.  For example, if I was building a high-frequency trading application, 
where I want to process individual trades as soon as they happen, I might lean 
towards Kafka streaming.

Agree about the benefits of using SQL with structured streaming.

Mohammed

From: kant kodali [mailto:kanth...@gmail.com]
Sent: Sunday, June 11, 2017 3:41 PM
To: Mohammed Guller <moham...@glassbeam.com>
Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; yohann jardin 
<yohannjar...@hotmail.com>; vaquar khan <vaquar.k...@gmail.com>; user 
<user@spark.apache.org>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

Also another difference I see is some thing like Spark Sql where there are 
logical plans, physical plans, Code generation and all those optimizations I 
don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
I appreciate the responses however I see the other side of the argument and I 
actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and Like 
wise data can be ingested from many sources in Kafka and send the results out 
to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of 
people saying Spark Streaming is real time and other group of people is Kafka 
streaming is the true real time. so do we say Micro-batch is real time or Event 
at a time is real time?

It is well known fact that Spark is more popular with Data scientists who want 
to run ML Algorithms and so on but I also hear that people can use H2O package 
along with Kafka Streaming. so efficient each of these approaches are is 
something I have no clue.

The major difference I see is actually the Spark Scheduler I don't think Kafka 
Streaming has anything like this instead it just allows you to run lambda 
expressions on a stream and write it out to specific topic/partition and from 
there one can use Kafka Connect to write it out to any sink. so In short, All 
the optimizations built into spark scheduler don't seem to exist in Kafka 
Streaming so if I were to make a decision on which framework to use this is an 
additional question I would think about like "Do I want my stream to go through 
the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski 
[mailto:vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>>
Cc: kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>>; vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: What

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
Also another difference I see is some thing like Spark Sql where there are
logical plans, physical plans, Code generation and all those optimizations
I don't see them in Kafka Streaming at this time.

On Sun, Jun 11, 2017 at 2:19 PM, kant kodali <kanth...@gmail.com> wrote:

> I appreciate the responses however I see the other side of the argument
> and I actually feel they are competitors now in Streaming space in some
> sense.
>
> Kafka Streaming can indeed do map, reduce, join and window operations and
> Like wise data can be ingested from many sources in Kafka and send the
> results out to many sinks. Look up "Kafka Connect"
>
> Regarding Event at a time vs Micro-batch. I hear arguments from a group of
> people saying Spark Streaming is real time and other group of people is
> Kafka streaming is the true real time. so do we say Micro-batch is real
> time or Event at a time is real time?
>
> It is well known fact that Spark is more popular with Data scientists who
> want to run ML Algorithms and so on but I also hear that people can use H2O
> package along with Kafka Streaming. so efficient each of these approaches
> are is something I have no clue.
>
> The major difference I see is actually the *Spark Scheduler* I don't
> think Kafka Streaming has anything like this instead it just allows you to
> run lambda expressions on a stream and write it out to specific
> topic/partition and from there one can use Kafka Connect to write it out to
> any sink. so In short, All the optimizations built into spark scheduler
> don't seem to exist in Kafka Streaming so if I were to make a decision on
> which framework to use this is an additional question I would think about
> like "Do I want my stream to go through the scheduler and if so, why or why
> not"
>
> Above all, please correct me if I am wrong :)
>
>
>
>
> On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
>> Just to elaborate more on Vincent wrote – Kafka streaming provides true
>> record-at-a-time processing capabilities whereas Spark Streaming provides
>> micro-batching capabilities on top of Spark. Depending on your use case,
>> you may find one better than the other. Both provide stateless ad stateful
>> stream processing capabilities.
>>
>>
>>
>> A few more things to consider:
>>
>>1. If you don’t already have a Spark cluster, but have Kafka cluster,
>>it may be easier to use Kafka streaming since you don’t need to setup and
>>manage another cluster.
>>2. On the other hand, if you already have a spark cluster, but don’t
>>have a Kafka cluster (in case you are using some other messaging system),
>>Spark streaming is a better option.
>>3. If you already know and use Spark, you may find it easier to
>>program with Spark Streaming API even if you are using Kafka.
>>4. Spark Streaming may give you better throughput. So you have to
>>decide what is more important for your stream processing application –
>>latency or throughput?
>>5. Kafka streaming is relatively new and less mature than Spark
>>Streaming
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* vincent gromakowski [mailto:vincent.gromakow...@gmail.com]
>> *Sent:* Sunday, June 11, 2017 12:09 PM
>> *To:* yohann jardin <yohannjar...@hotmail.com>
>> *Cc:* kant kodali <kanth...@gmail.com>; vaquar khan <
>> vaquar.k...@gmail.com>; user <user@spark.apache.org>
>> *Subject:* Re: What is the real difference between Kafka streaming and
>> Spark Streaming?
>>
>>
>>
>> I think Kafka streams is good when the processing of each row is
>> independant from each other (row parsing, data cleaning...)
>>
>> Spark is better when processing group of rows (group by, ml, window
>> func...)
>>
>>
>>
>> Le 11 juin 2017 8:15 PM, "yohann jardin" <yohannjar...@hotmail.com> a
>> écrit :
>>
>> Hey,
>>
>> Kafka can also do streaming on its own: https://kafka.apache.org/docum
>> entation/streams
>> I don’t know much about it unfortunately. I can only repeat what I heard
>> in conferences, saying that one should give a try to Kafka streaming when
>> its whole pipeline is using Kafka. I have no pros/cons to argument on this
>> topic.
>>
>> *Yohann Jardin*
>>
>> Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :
>>
>> Hi Kant,
>>
>> Kafka is the message broker that using as Producers and Consumers and
>> Spark Streaming is used as the real time processing ,Kafka and Spark
>>

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread kant kodali
I appreciate the responses however I see the other side of the argument and
I actually feel they are competitors now in Streaming space in some sense.

Kafka Streaming can indeed do map, reduce, join and window operations and
Like wise data can be ingested from many sources in Kafka and send the
results out to many sinks. Look up "Kafka Connect"

Regarding Event at a time vs Micro-batch. I hear arguments from a group of
people saying Spark Streaming is real time and other group of people is
Kafka streaming is the true real time. so do we say Micro-batch is real
time or Event at a time is real time?

It is well known fact that Spark is more popular with Data scientists who
want to run ML Algorithms and so on but I also hear that people can use H2O
package along with Kafka Streaming. so efficient each of these approaches
are is something I have no clue.

The major difference I see is actually the *Spark Scheduler* I don't think
Kafka Streaming has anything like this instead it just allows you to run
lambda expressions on a stream and write it out to specific topic/partition
and from there one can use Kafka Connect to write it out to any sink. so In
short, All the optimizations built into spark scheduler don't seem to exist
in Kafka Streaming so if I were to make a decision on which framework to
use this is an additional question I would think about like "Do I want my
stream to go through the scheduler and if so, why or why not"

Above all, please correct me if I am wrong :)




On Sun, Jun 11, 2017 at 12:41 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> Just to elaborate more on Vincent wrote – Kafka streaming provides true
> record-at-a-time processing capabilities whereas Spark Streaming provides
> micro-batching capabilities on top of Spark. Depending on your use case,
> you may find one better than the other. Both provide stateless ad stateful
> stream processing capabilities.
>
>
>
> A few more things to consider:
>
>1. If you don’t already have a Spark cluster, but have Kafka cluster,
>it may be easier to use Kafka streaming since you don’t need to setup and
>manage another cluster.
>2. On the other hand, if you already have a spark cluster, but don’t
>have a Kafka cluster (in case you are using some other messaging system),
>Spark streaming is a better option.
>3. If you already know and use Spark, you may find it easier to
>program with Spark Streaming API even if you are using Kafka.
>4. Spark Streaming may give you better throughput. So you have to
>decide what is more important for your stream processing application –
>latency or throughput?
>5. Kafka streaming is relatively new and less mature than Spark
>Streaming
>
>
>
> Mohammed
>
>
>
> *From:* vincent gromakowski [mailto:vincent.gromakow...@gmail.com]
> *Sent:* Sunday, June 11, 2017 12:09 PM
> *To:* yohann jardin <yohannjar...@hotmail.com>
> *Cc:* kant kodali <kanth...@gmail.com>; vaquar khan <vaquar.k...@gmail.com>;
> user <user@spark.apache.org>
> *Subject:* Re: What is the real difference between Kafka streaming and
> Spark Streaming?
>
>
>
> I think Kafka streams is good when the processing of each row is
> independant from each other (row parsing, data cleaning...)
>
> Spark is better when processing group of rows (group by, ml, window
> func...)
>
>
>
> Le 11 juin 2017 8:15 PM, "yohann jardin" <yohannjar...@hotmail.com> a
> écrit :
>
> Hey,
>
> Kafka can also do streaming on its own: https://kafka.apache.org/
> documentation/streams
> I don’t know much about it unfortunately. I can only repeat what I heard
> in conferences, saying that one should give a try to Kafka streaming when
> its whole pipeline is using Kafka. I have no pros/cons to argument on this
> topic.
>
> *Yohann Jardin*
>
> Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :
>
> Hi Kant,
>
> Kafka is the message broker that using as Producers and Consumers and
> Spark Streaming is used as the real time processing ,Kafka and Spark
> Streaming work together not competitors.
>
> Spark Streaming is reading data from Kafka and process into micro batching
> for streaming data, In easy terms collects data for some time, build RDD
> and then process these micro batches.
>
>
>
>
>
> Please read doc : https://spark.apache.org/docs/latest/streaming-
> programming-guide.html
>
>
>
> Spark Streaming is an extension of the core Spark API that enables
> scalable, high-throughput, fault-tolerant stream processing of live data
> streams. Data can be ingested from many sources like *Kafka, Flume,
> Kinesis, or TCP sockets*, and can be processed using complex algorithms
> expressed with high-level fu

RE: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread Mohammed Guller
Just to elaborate more on Vincent wrote – Kafka streaming provides true 
record-at-a-time processing capabilities whereas Spark Streaming provides 
micro-batching capabilities on top of Spark. Depending on your use case, you 
may find one better than the other. Both provide stateless ad stateful stream 
processing capabilities.

A few more things to consider:

  1.  If you don’t already have a Spark cluster, but have Kafka cluster, it may 
be easier to use Kafka streaming since you don’t need to setup and manage 
another cluster.
  2.  On the other hand, if you already have a spark cluster, but don’t have a 
Kafka cluster (in case you are using some other messaging system), Spark 
streaming is a better option.
  3.  If you already know and use Spark, you may find it easier to program with 
Spark Streaming API even if you are using Kafka.
  4.  Spark Streaming may give you better throughput. So you have to decide 
what is more important for your stream processing application – latency or 
throughput?
  5.  Kafka streaming is relatively new and less mature than Spark Streaming

Mohammed

From: vincent gromakowski [mailto:vincent.gromakow...@gmail.com]
Sent: Sunday, June 11, 2017 12:09 PM
To: yohann jardin <yohannjar...@hotmail.com>
Cc: kant kodali <kanth...@gmail.com>; vaquar khan <vaquar.k...@gmail.com>; user 
<user@spark.apache.org>
Subject: Re: What is the real difference between Kafka streaming and Spark 
Streaming?

I think Kafka streams is good when the processing of each row is independant 
from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin" 
<yohannjar...@hotmail.com<mailto:yohannjar...@hotmail.com>> a écrit :

Hey,
Kafka can also do streaming on its own: 
https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark 
Streaming is used as the real time processing ,Kafka and Spark Streaming work 
together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching for 
streaming data, In easy terms collects data for some time, build RDD and then 
process these micro batches.


Please read doc : 
https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, 
high-throughput, fault-tolerant stream processing of live data streams. Data 
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 
and can be processed using complex algorithms expressed with high-level 
functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can 
apply Spark’s machine 
learning<https://spark.apache.org/docs/latest/ml-guide.html> and graph 
processing<https://spark.apache.org/docs/latest/graphx-programming-guide.html> 
algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka 
Streaming vs Spark Streaming other than saying one can be used as part of Micro 
services (since Kafka streaming is just a library) and the other is a 
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling 
question for me so it would be great to know what Spark streaming can do that 
Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783<tel:(224)%20436-0783>
Greater Chicago




Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vincent gromakowski
I think Kafka streams is good when the processing of each row is
independant from each other (row parsing, data cleaning...)
Spark is better when processing group of rows (group by, ml, window func...)

Le 11 juin 2017 8:15 PM, "yohann jardin"  a
écrit :

Hey,
Kafka can also do streaming on its own: https://kafka.apache.org/
documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in
conferences, saying that one should give a try to Kafka streaming when its
whole pipeline is using Kafka. I have no pros/cons to argument on this
topic.

*Yohann Jardin*
Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.


Please read doc : https://spark.apache.org/docs/latest/streaming-
programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
 and graph processing
 algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali  wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783 <(224)%20436-0783>
Greater Chicago


Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
Hey,

Kafka can also do streaming on its own: 
https://kafka.apache.org/documentation/streams
I don’t know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin

Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark 
Streaming is used as the real time processing ,Kafka and Spark Streaming work 
together not competitors.

Spark Streaming is reading data from Kafka and process into micro batching for 
streaming data, In easy terms collects data for some time, build RDD and then 
process these micro batches.


Please read doc : 
https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, 
high-throughput, fault-tolerant stream processing of live data streams. Data 
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 
and can be processed using complex algorithms expressed with high-level 
functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can 
apply Spark’s machine 
learning and graph 
processing 
algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali 
> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka 
Streaming vs Spark Streaming other than saying one can be used as part of Micro 
services (since Kafka streaming is just a library) and the other is a 
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling 
question for me so it would be great to know what Spark streaming can do that 
Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago



Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
Hey,

Kafka can also do streaming on its own: 
https://kafka.apache.org/documentation/streams
I don't know much about it unfortunately. I can only repeat what I heard in 
conferences, saying that one should give a try to Kafka streaming when its 
whole pipeline is using Kafka. I have no pros/cons to argument on this topic.

Yohann Jardin

Le 6/11/2017 à 7:08 PM, vaquar khan a écrit :

Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark 
Streaming is used as the real time processing ,Kafka and Spark Streaming work 
together not competitors.

Spark Streaming is reading data from Kafka and process into micro batching for 
streaming data, In easy terms collects data for some time, build RDD and then 
process these micro batches.


Please read doc : 
https://spark.apache.org/docs/latest/streaming-programming-guide.html


Spark Streaming is an extension of the core Spark API that enables scalable, 
high-throughput, fault-tolerant stream processing of live data streams. Data 
can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, 
and can be processed using complex algorithms expressed with high-level 
functions like map, reduce, join and window. Finally, processed data can be 
pushed out to filesystems, databases, and live dashboards. In fact, you can 
apply Spark’s machine 
learning and graph 
processing 
algorithms on data streams.


Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali 
> wrote:
Hi All,

I am trying hard to figure out what is the real difference between Kafka 
Streaming vs Spark Streaming other than saying one can be used as part of Micro 
services (since Kafka streaming is just a library) and the other is a 
Standalone framework by itself.

If I can accomplish same job one way or other this is a sort of a puzzling 
question for me so it would be great to know what Spark streaming can do that 
Kafka Streaming cannot do efficiently or whatever ?

Thanks!




--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago



Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vaquar khan
Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.


Please read doc :
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
 and graph processing
 algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali  wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago