Till and Stephan, thanks for your clarification.
@Till One more question, from what I have read about the checkpointing [1],
the list operations don't seem likely to be performed frequently, so
storing state backend on s3 shouldn't have any severe impact on flink
performance. Is this assumption right?
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html
-- Ayush
On Fri, May 12, 2017 at 1:05 AM Stephan Ewen <se...@apache.org> wrote:
> Small addition to Till's comment:
>
> In the case where file:// points to a mounted distributed file system
> (NFS, MapRFs, ...), then it actually works. The important thing is that the
> filesystem where the checkpoints go is replicated (fault tolerant) and
> accessible from all nodes.
>
> On Thu, May 11, 2017 at 2:16 PM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Ayush,
>>
>> you’re right that RocksDB is the recommend state backend because of the
>> above-mentioned reasons. In order to make the recovery properly work, you
>> have to configure a shared directory for the checkpoint data via
>> state.backend.fs.checkpointdir. You can basically configure any file
>> system which is supported by Hadoop (no HDFS required). The reason is that
>> we use Hadoop to bridge between different file systems. The only thing you
>> have to make sure is that you have the respective file system
>> implementation in your class path.
>>
>> I think you can access Windows Azure Blob Storage via Hadoop [1]
>> similarly to access S3, for example.
>>
>> If you use S3 to store your checkpoint data, then you will benefit from
>> all the advantages of S3 but also suffer from its drawbacks (e.g. that list
>> operations are more costly). But these are not specific to Flink.
>>
>> A URL like file:// usually indicates a local file. Thus, if your Flink
>> cluster is not running on a single machine, then this won’t work.
>>
>> [1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
>>
>> Cheers,
>> Till
>>
>>
>> On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal <ay...@helpshift.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I had a few questions regarding checkpoint storage options using
>>> RocksDBStateBackend. In the flink 1.2 documentation, it is the
>>> recommended state
>>> backend due to it's ability to store large states and asynchronous
>>> snapshotting.
>>> For high availabilty it seems HDFS is the recommended store for state
>>> backend
>>> data. In AWS deployment section, it is also mentioned that s3 can be
>>> used for
>>> storing state backend data.
>>>
>>> We don't want to depend on a hadoop cluster for flink deployment, so I
>>> had
>>> following questions:
>>>
>>> 1. Can we use any storage backend supported by flink for storing RocksDB
>>>
>>> StateBackend data with file urls: there are quite a few supported as
>>> mentioned here:
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html
>>> and here:
>>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>>>
>>> 2. Is there some work already done to support Windows Azure Blob Storage
>>> for
>>> storing State backend data? There are some docs here:
>>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>>> can we utilize this for that?
>>>
>>> 3. If utilizing S3 for state backend, is there any performance impact?
>>>
>>> 4. For high availability can we use a NFS volume for state backend, with
>>>
>>> "file://" urls? Will there be any performance impact?
>>>
>>> PS: I posted this email earlier via nabble, but it's not showing up in
>>> apache archive. So sending again. Apologies if it results in multiple
>>> threads.
>>>
>>> -- Ayush
>>>
>>
>>
>