Storage options for RocksDBStateBackend

2017-05-11 Thread ayush
Hello,

I had a few questions regarding checkpoint storage options using
RocksDBStateBackend. In the flink 1.2 documentation, it is the recommended
state
backend due to it's ability to store large states and asynchronous
snapshotting.
For high availabilty it seems HDFS is the recommended store for state
backend
data. In AWS deployment section, it is also mentioned that s3 can be used
for
storing state backend data.

We don't want to depend on a hadoop cluster for flink deployment, so I had
following questions:

1. Can we use any storage backend supported by flink for storing RocksDB
StateBackend data with file urls: there are quite a few supported as
mentioned here:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html
and here: 
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md

2. Is there some work already done to support Windows Azure Blob Storage for
storing State backend data? There are some docs here:
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
can we utilize this for that?

3. If utilizing S3 for state backend, is there any performance impact?

4. For high availability can we use a NFS volume for state backend, with
"file://" urls? Will there be any performance impact?

-- Ayush



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Storage-options-for-RocksDBStateBackend-tp13102.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Storage options for RocksDBStateBackend

2017-05-11 Thread Ayush Goyal
Hello,

I had a few questions regarding checkpoint storage options using
RocksDBStateBackend. In the flink 1.2 documentation, it is the recommended
state
backend due to it's ability to store large states and asynchronous
snapshotting.
For high availabilty it seems HDFS is the recommended store for state
backend
data. In AWS deployment section, it is also mentioned that s3 can be used
for
storing state backend data.

We don't want to depend on a hadoop cluster for flink deployment, so I had
following questions:

1. Can we use any storage backend supported by flink for storing RocksDB
StateBackend data with file urls: there are quite a few supported as
mentioned here:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html
and here:
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md

2. Is there some work already done to support Windows Azure Blob Storage for

storing State backend data? There are some docs here:
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
can we utilize this for that?

3. If utilizing S3 for state backend, is there any performance impact?

4. For high availability can we use a NFS volume for state backend, with
"file://" urls? Will there be any performance impact?

PS: I posted this email earlier via nabble, but it's not showing up in
apache archive. So sending again. Apologies if it results in multiple
threads.

-- Ayush


Re: Storage options for RocksDBStateBackend

2017-05-11 Thread Till Rohrmann
Hi Ayush,

you’re right that RocksDB is the recommend state backend because of the
above-mentioned reasons. In order to make the recovery properly work, you
have to configure a shared directory for the checkpoint data via
state.backend.fs.checkpointdir. You can basically configure any file system
which is supported by Hadoop (no HDFS required). The reason is that we use
Hadoop to bridge between different file systems. The only thing you have to
make sure is that you have the respective file system implementation in
your class path.

I think you can access Windows Azure Blob Storage via Hadoop [1] similarly
to access S3, for example.

If you use S3 to store your checkpoint data, then you will benefit from all
the advantages of S3 but also suffer from its drawbacks (e.g. that list
operations are more costly). But these are not specific to Flink.

A URL like file:// usually indicates a local file. Thus, if your Flink
cluster is not running on a single machine, then this won’t work.

[1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html

Cheers,
Till
​

On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal  wrote:

> Hello,
>
> I had a few questions regarding checkpoint storage options using
> RocksDBStateBackend. In the flink 1.2 documentation, it is the recommended
> state
> backend due to it's ability to store large states and asynchronous
> snapshotting.
> For high availabilty it seems HDFS is the recommended store for state
> backend
> data. In AWS deployment section, it is also mentioned that s3 can be used
> for
> storing state backend data.
>
> We don't want to depend on a hadoop cluster for flink deployment, so I had
>
> following questions:
>
> 1. Can we use any storage backend supported by flink for storing RocksDB
> StateBackend data with file urls: there are quite a few supported as
> mentioned here:
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/
> filesystems.html
> and here:
> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>
> 2. Is there some work already done to support Windows Azure Blob Storage
> for
> storing State backend data? There are some docs here:
> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
> can we utilize this for that?
>
> 3. If utilizing S3 for state backend, is there any performance impact?
>
> 4. For high availability can we use a NFS volume for state backend, with
> "file://" urls? Will there be any performance impact?
>
> PS: I posted this email earlier via nabble, but it's not showing up in
> apache archive. So sending again. Apologies if it results in multiple
> threads.
>
> -- Ayush
>


Re: Storage options for RocksDBStateBackend

2017-05-11 Thread Stephan Ewen
Small addition to Till's comment:

In the case where file:// points to a mounted distributed file system (NFS,
MapRFs, ...), then it actually works. The important thing is that the
filesystem where the checkpoints go is replicated (fault tolerant) and
accessible from all nodes.

On Thu, May 11, 2017 at 2:16 PM, Till Rohrmann  wrote:

> Hi Ayush,
>
> you’re right that RocksDB is the recommend state backend because of the
> above-mentioned reasons. In order to make the recovery properly work, you
> have to configure a shared directory for the checkpoint data via
> state.backend.fs.checkpointdir. You can basically configure any file
> system which is supported by Hadoop (no HDFS required). The reason is that
> we use Hadoop to bridge between different file systems. The only thing you
> have to make sure is that you have the respective file system
> implementation in your class path.
>
> I think you can access Windows Azure Blob Storage via Hadoop [1] similarly
> to access S3, for example.
>
> If you use S3 to store your checkpoint data, then you will benefit from
> all the advantages of S3 but also suffer from its drawbacks (e.g. that list
> operations are more costly). But these are not specific to Flink.
>
> A URL like file:// usually indicates a local file. Thus, if your Flink
> cluster is not running on a single machine, then this won’t work.
>
> [1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
>
> Cheers,
> Till
> ​
>
> On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal  wrote:
>
>> Hello,
>>
>> I had a few questions regarding checkpoint storage options using
>> RocksDBStateBackend. In the flink 1.2 documentation, it is the
>> recommended state
>> backend due to it's ability to store large states and asynchronous
>> snapshotting.
>> For high availabilty it seems HDFS is the recommended store for state
>> backend
>> data. In AWS deployment section, it is also mentioned that s3 can be used
>> for
>> storing state backend data.
>>
>> We don't want to depend on a hadoop cluster for flink deployment, so I had
>>
>> following questions:
>>
>> 1. Can we use any storage backend supported by flink for storing RocksDB
>> StateBackend data with file urls: there are quite a few supported as
>> mentioned here:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>> internals/filesystems.html
>> and here:
>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>>
>> 2. Is there some work already done to support Windows Azure Blob Storage
>> for
>> storing State backend data? There are some docs here:
>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>> can we utilize this for that?
>>
>> 3. If utilizing S3 for state backend, is there any performance impact?
>>
>> 4. For high availability can we use a NFS volume for state backend, with
>> "file://" urls? Will there be any performance impact?
>>
>> PS: I posted this email earlier via nabble, but it's not showing up in
>> apache archive. So sending again. Apologies if it results in multiple
>> threads.
>>
>> -- Ayush
>>
>
>


Re: Storage options for RocksDBStateBackend

2017-05-11 Thread Ayush Goyal
Till and Stephan, thanks for your clarification.

@Till One more question, from what I have read about the checkpointing [1],
the list operations don't seem likely to be performed frequently, so
storing state backend on s3 shouldn't have any severe impact on flink
performance. Is this assumption right?

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html

-- Ayush

On Fri, May 12, 2017 at 1:05 AM Stephan Ewen  wrote:

> Small addition to Till's comment:
>
> In the case where file:// points to a mounted distributed file system
> (NFS, MapRFs, ...), then it actually works. The important thing is that the
> filesystem where the checkpoints go is replicated (fault tolerant) and
> accessible from all nodes.
>
> On Thu, May 11, 2017 at 2:16 PM, Till Rohrmann 
> wrote:
>
>> Hi Ayush,
>>
>> you’re right that RocksDB is the recommend state backend because of the
>> above-mentioned reasons. In order to make the recovery properly work, you
>> have to configure a shared directory for the checkpoint data via
>> state.backend.fs.checkpointdir. You can basically configure any file
>> system which is supported by Hadoop (no HDFS required). The reason is that
>> we use Hadoop to bridge between different file systems. The only thing you
>> have to make sure is that you have the respective file system
>> implementation in your class path.
>>
>> I think you can access Windows Azure Blob Storage via Hadoop [1]
>> similarly to access S3, for example.
>>
>> If you use S3 to store your checkpoint data, then you will benefit from
>> all the advantages of S3 but also suffer from its drawbacks (e.g. that list
>> operations are more costly). But these are not specific to Flink.
>>
>> A URL like file:// usually indicates a local file. Thus, if your Flink
>> cluster is not running on a single machine, then this won’t work.
>>
>> [1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html
>>
>> Cheers,
>> Till
>> ​
>>
>> On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal 
>> wrote:
>>
>>> Hello,
>>>
>>> I had a few questions regarding checkpoint storage options using
>>> RocksDBStateBackend. In the flink 1.2 documentation, it is the
>>> recommended state
>>> backend due to it's ability to store large states and asynchronous
>>> snapshotting.
>>> For high availabilty it seems HDFS is the recommended store for state
>>> backend
>>> data. In AWS deployment section, it is also mentioned that s3 can be
>>> used for
>>> storing state backend data.
>>>
>>> We don't want to depend on a hadoop cluster for flink deployment, so I
>>> had
>>> following questions:
>>>
>>> 1. Can we use any storage backend supported by flink for storing RocksDB
>>>
>>> StateBackend data with file urls: there are quite a few supported as
>>> mentioned here:
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html
>>> and here:
>>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>>>
>>> 2. Is there some work already done to support Windows Azure Blob Storage
>>> for
>>> storing State backend data? There are some docs here:
>>> https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
>>> can we utilize this for that?
>>>
>>> 3. If utilizing S3 for state backend, is there any performance impact?
>>>
>>> 4. For high availability can we use a NFS volume for state backend, with
>>>
>>> "file://" urls? Will there be any performance impact?
>>>
>>> PS: I posted this email earlier via nabble, but it's not showing up in
>>> apache archive. So sending again. Apologies if it results in multiple
>>> threads.
>>>
>>> -- Ayush
>>>
>>
>>
>


Re: Storage options for RocksDBStateBackend

2017-05-15 Thread Jain, Ankit
Also, I hope state & checkpointing writes to S3 happens async w/o impacting the 
actual job execution graph?

If so, will there still be a performance impact from using S3?

Thanks
Ankit

From: Ayush Goyal 
Date: Thursday, May 11, 2017 at 11:21 PM
To: Stephan Ewen , Till Rohrmann 
Cc: user 
Subject: Re: Storage options for RocksDBStateBackend

Till and Stephan, thanks for your clarification.

@Till One more question, from what I have read about the checkpointing [1], the 
list operations don't seem likely to be performed frequently, so storing state 
backend on s3 shouldn't have any severe impact on flink performance. Is this 
assumption right?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html

-- Ayush

On Fri, May 12, 2017 at 1:05 AM Stephan Ewen 
mailto:se...@apache.org>> wrote:
Small addition to Till's comment:

In the case where file:// points to a mounted distributed file system (NFS, 
MapRFs, ...), then it actually works. The important thing is that the 
filesystem where the checkpoints go is replicated (fault tolerant) and 
accessible from all nodes.

On Thu, May 11, 2017 at 2:16 PM, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:

Hi Ayush,

you’re right that RocksDB is the recommend state backend because of the 
above-mentioned reasons. In order to make the recovery properly work, you have 
to configure a shared directory for the checkpoint data via 
state.backend.fs.checkpointdir. You can basically configure any file system 
which is supported by Hadoop (no HDFS required). The reason is that we use 
Hadoop to bridge between different file systems. The only thing you have to 
make sure is that you have the respective file system implementation in your 
class path.

I think you can access Windows Azure Blob Storage via Hadoop [1] similarly to 
access S3, for example.

If you use S3 to store your checkpoint data, then you will benefit from all the 
advantages of S3 but also suffer from its drawbacks (e.g. that list operations 
are more costly). But these are not specific to Flink.

A URL like file:// usually indicates a local file. Thus, if your Flink cluster 
is not running on a single machine, then this won’t work.

[1] https://hadoop.apache.org/docs/stable/hadoop-azure/index.html

Cheers,
Till
​

On Thu, May 11, 2017 at 10:41 AM, Ayush Goyal 
mailto:ay...@helpshift.com>> wrote:
Hello,

I had a few questions regarding checkpoint storage options using
RocksDBStateBackend. In the flink 1.2 documentation, it is the recommended state
backend due to it's ability to store large states and asynchronous snapshotting.
For high availabilty it seems HDFS is the recommended store for state backend
data. In AWS deployment section, it is also mentioned that s3 can be used for
storing state backend data.

We don't want to depend on a hadoop cluster for flink deployment, so I had
following questions:

1. Can we use any storage backend supported by flink for storing RocksDB
StateBackend data with file urls: there are quite a few supported as mentioned 
here:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/filesystems.html
and here:
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md

2. Is there some work already done to support Windows Azure Blob Storage for
storing State backend data? There are some docs here:
https://github.com/apache/flink/blob/master/docs/dev/batch/connectors.md
can we utilize this for that?

3. If utilizing S3 for state backend, is there any performance impact?

4. For high availability can we use a NFS volume for state backend, with
"file://" urls? Will there be any performance impact?

PS: I posted this email earlier via nabble, but it's not showing up in apache 
archive. So sending again. Apologies if it results in multiple threads.

-- Ayush