Hi Kant,

Jumping in here, would love corrections if I'm wrong about any of this.

In short answer, no, HDFS is not necessary to run stateful stream
processing. In the minimal case, you can use the MemoryStateBackend to back
up your state onto the JobManager.

In any production scenario, you will want more durability for your
checkpoints and larger state size. To do this, you should use either
RocksDBStateBackend or FsStateBackend. Assuming you want one of these, you
will need a checkpoint directory on a filesystem that is accessible by all
TaskManagers. The filesystem for this checkpointing directory
(state.backend.*.checkpointdir) can be a shared drive or anything supported
by the Hadoop file backend see:
*https://hadoop.apache.org/docs/stable/index.html
<https://hadoop.apache.org/docs/stable/index.html>*
under Hadoop Compatible File Systems for other alternatives (S3, for
example).

Choosing RocksDBStateBackend vs. FsStateBackend is a different decision.
FsStateBackend stores in-flight state in memory and writes it to your
durable filesystem only when checkpoints are initiated. The
RocksDBStateBackend stores in-flight data on local disk (in RocksDB)
instead of in-memory. When checkpoints are initiated, the appropriate state
is then written to the durable filesystem. Because it stores state on disk,
RocksDBStateBackend can handle much larger state than FsStateBackend on
equivalent hardware.

I'm drawing most of this from this page:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_
backends.html

Does that make sense?

Cheers,
Wolfe

~
Brian Wolfe


On Fri, Apr 7, 2017 at 2:32 AM, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I read the docs however I still have the following question For Stateful
> stream processing is HDFS mandatory? because In some places I see it is
> required and other places I see that rocksDB can be used. I just want to
> know if HDFS is mandatory for Stateful stream processing?
>
> Thanks!
>
  • Hi kant kodali
    • Re: Hi Brian Wolfe

Reply via email to