Hi Felipe

>From what I remember, Spark still use micro-batch to shuffle data in structed 
>streaming.
For Flink, it actually process elements per record, there is no actual disk-io 
shuffle in Flink streaming. And record would emit to downstream by select 
specific channel through network[1]. That's why we need to call "keyBy" before 
using windows, "KeyGroupStreamPartitioner" would then be used to select the 
target channel based on the key group index. Data would first be stored in 
local state backend and wait for polled out once a window triggered but not 
"shuffled" until a window triggered.


[1] 
https://github.com/apache/flink/blob/2c2095bdad3d47f27973a585112ed820f457de6f/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/writer/ChannelSelectorRecordWriter.java#L60

Best
Yun Tang

________________________________
From: Felipe Gutierrez <felipe.o.gutier...@gmail.com>
Sent: Friday, October 11, 2019 15:47
To: Yun Tang <myas...@live.com>
Cc: user <user@flink.apache.org>
Subject: Re: Difference between windows in Spark and Flink

Hi Yun,

that is a very complete answer. Thanks!

I was also wondering about the mini-batches that Spark creates when we have to 
create a SparkStream context. It still remains for all versions of stream 
processing in Spark, isn't it? And because that I Spark shuffles data [1] to 
wide-dependent operators every time mini-batch ends, doesn't it?
In this way Flink does not have mini-batches, hence I will shuffle data to 
wide-dependent operators only when a window is triggered. Am I correct?

[1] 
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/rdd-programming-guide.html#shuffle-operations

Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
-- https://felipeogutierrez.blogspot.com


On Thu, Oct 10, 2019 at 7:25 PM Yun Tang 
<myas...@live.com<mailto:myas...@live.com>> wrote:
Hi Felipe

Generally speaking, the key difference which impacts the performance is where 
they store data within windows.
For Flink, it would store data and its related time-stamp within windows in 
state backend[1]. Once window is triggered, it would pull all the stored timer 
with coupled record-key, and then use the record-key to query state backend for 
next actions.

For Spark, first of all, we would talk about structured streaming [2] as it's 
better than previous spark streaming especially on window scenario. Unlike 
Flink built-in supported rocksDB state backend, Spark has only one 
implementation of state store providers. It's HDFSBackedStateStoreProvider 
which stores all of the data in memory, what is a very memory consuming 
approach and might come across OOM errors[3][4][5].

To avoid this, Databricks Runtime offer a 'RocksDBStateStoreProvider' but not 
open-source. We're lucky that open-source Flink already offers built-in RocksDB 
state backend to avoid OOM problem. Moreover, Flink community recently are 
developing spill-able memory state backend [7].

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html
[2] 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
[3] 
https://medium.com/@chandanbaranwal/state-management-in-spark-structured-streaming-aaa87b6c9d31
[4] 
http://apache-spark-user-list.1001560.n3.nabble.com/use-rocksdb-for-spark-structured-streaming-SSS-td34776.html#a34779
[5] https://github.com/chermenin/spark-states
[6] 
https://docs.databricks.com/spark/latest/structured-streaming/production.html#optimize-performance-of-stateful-streaming-queries
[7] https://issues.apache.org/jira/browse/FLINK-12692

Best
Yun Tang



________________________________
From: Felipe Gutierrez 
<felipe.o.gutier...@gmail.com<mailto:felipe.o.gutier...@gmail.com>>
Sent: Thursday, October 10, 2019 20:39
To: user <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Difference between windows in Spark and Flink

Hi all,

I am trying to think about the essential differences between operators in Flink 
and Spark. Especially when I am using Keyed Windows then a reduce operation.
In Flink we develop an application that can logically separate these two 
operators. Thus after a keyed window I can use .reduce/aggregate/fold/apply() 
functions [1].
In Spark we have window/reduceByKeyAndWindow functions which to me appears it 
is less flexible in the options to use with a keyed window operation [2].
Moreover, when these two applications are deployed in a Flink and Spark cluster 
respectively, what are the differences between their physical operators running 
in the cluster?

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#windows
[2] 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Thanks,
Felipe
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
-- https://felipeogutierrez.blogspot.com

Reply via email to