Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-26 Thread Yang Wang
Usually, you should use the HDFS nameservice instead of the NameNode
hostname:port to avoid NN failover.
And you could find the supported nameservice in the hdfs-site.xml in the
key *dfs.nameservices*.


Best,
Yang

On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal  wrote:

> So, when we create an EMR cluster the NN service runs on the primary node
> of the cluster.
> Now at the time of creating the cluster, how can we specify the name of
> this NN in format hdfs://*namenode-host*:8020/.
>
> Is there a standard name by which we can identify the NN server ?
>
> Thanks
> Sachin
>
>
> On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera 
> wrote:
>
>> Hello Sachin,
>>
>> Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes
>> down or VMs are required to be shut down for security updates or due to
>> faults, new VMs will be added to the cluster. As a result, any data stored
>> in the local file system, such as file://tmp, would be lost. To ensure data
>> persistence and prevent loss of checkpoint or savepoint data for recovery,
>> it is advisable to store such data in a persistent storage solution like
>> HDFS or S3.
>>
>> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP
>> details from EMR service.
>>
>> Hope this helps.
>>
>> -A
>>
>>
>> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal 
>> wrote:
>>
>>> Hi,
>>> We are using AWS EMR where we can submit our flink jobs to a long
>>> running flink cluster on Yarn.
>>>
>>> We wanted to configure RocksDBStateBackend as our state backend to store
>>> our checkpoints.
>>>
>>> So we have configured following properties in our flink-conf.yaml
>>>
>>>- state.backend.type: rocksdb
>>>- state.checkpoints.dir: file:///tmp
>>>- state.backend.incremental: true
>>>
>>>
>>> My question here is regarding the checkpoint location: what is the
>>> difference between the location if it is a local filesystem vs a hadoop
>>> distributed file system (hdfs).
>>>
>>> What advantages we get if we use:
>>>
>>> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
>>> vs
>>> *state.checkpoints.dir*: file:///tmp
>>>
>>> Also if we decide to use HDFS then from where we can get the value for
>>> *namenode-host:port*
>>> given we are running Flink on an EMR.
>>>
>>> Thanks
>>> Sachin
>>>
>>>
>>>


Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-22 Thread Sachin Mittal
So, when we create an EMR cluster the NN service runs on the primary node
of the cluster.
Now at the time of creating the cluster, how can we specify the name of
this NN in format hdfs://*namenode-host*:8020/.

Is there a standard name by which we can identify the NN server ?

Thanks
Sachin


On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera 
wrote:

> Hello Sachin,
>
> Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes
> down or VMs are required to be shut down for security updates or due to
> faults, new VMs will be added to the cluster. As a result, any data stored
> in the local file system, such as file://tmp, would be lost. To ensure data
> persistence and prevent loss of checkpoint or savepoint data for recovery,
> it is advisable to store such data in a persistent storage solution like
> HDFS or S3.
>
> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP
> details from EMR service.
>
> Hope this helps.
>
> -A
>
>
> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal  wrote:
>
>> Hi,
>> We are using AWS EMR where we can submit our flink jobs to a long running
>> flink cluster on Yarn.
>>
>> We wanted to configure RocksDBStateBackend as our state backend to store
>> our checkpoints.
>>
>> So we have configured following properties in our flink-conf.yaml
>>
>>- state.backend.type: rocksdb
>>- state.checkpoints.dir: file:///tmp
>>- state.backend.incremental: true
>>
>>
>> My question here is regarding the checkpoint location: what is the
>> difference between the location if it is a local filesystem vs a hadoop
>> distributed file system (hdfs).
>>
>> What advantages we get if we use:
>>
>> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
>> vs
>> *state.checkpoints.dir*: file:///tmp
>>
>> Also if we decide to use HDFS then from where we can get the value for
>> *namenode-host:port*
>> given we are running Flink on an EMR.
>>
>> Thanks
>> Sachin
>>
>>
>>


Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-21 Thread Asimansu Bera
Hello Sachin,

Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes
down or VMs are required to be shut down for security updates or due to
faults, new VMs will be added to the cluster. As a result, any data stored
in the local file system, such as file://tmp, would be lost. To ensure data
persistence and prevent loss of checkpoint or savepoint data for recovery,
it is advisable to store such data in a persistent storage solution like
HDFS or S3.

Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP
details from EMR service.

Hope this helps.

-A


On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal  wrote:

> Hi,
> We are using AWS EMR where we can submit our flink jobs to a long running
> flink cluster on Yarn.
>
> We wanted to configure RocksDBStateBackend as our state backend to store
> our checkpoints.
>
> So we have configured following properties in our flink-conf.yaml
>
>- state.backend.type: rocksdb
>- state.checkpoints.dir: file:///tmp
>- state.backend.incremental: true
>
>
> My question here is regarding the checkpoint location: what is the
> difference between the location if it is a local filesystem vs a hadoop
> distributed file system (hdfs).
>
> What advantages we get if we use:
>
> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
> vs
> *state.checkpoints.dir*: file:///tmp
>
> Also if we decide to use HDFS then from where we can get the value for
> *namenode-host:port*
> given we are running Flink on an EMR.
>
> Thanks
> Sachin
>
>
>


Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-21 Thread Sachin Mittal
Hi,
We are using AWS EMR where we can submit our flink jobs to a long running
flink cluster on Yarn.

We wanted to configure RocksDBStateBackend as our state backend to store
our checkpoints.

So we have configured following properties in our flink-conf.yaml

   - state.backend.type: rocksdb
   - state.checkpoints.dir: file:///tmp
   - state.backend.incremental: true


My question here is regarding the checkpoint location: what is the
difference between the location if it is a local filesystem vs a hadoop
distributed file system (hdfs).

What advantages we get if we use:

*state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
vs
*state.checkpoints.dir*: file:///tmp

Also if we decide to use HDFS then from where we can get the value for
*namenode-host:port*
given we are running Flink on an EMR.

Thanks
Sachin