Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR
Usually, you should use the HDFS nameservice instead of the NameNode hostname:port to avoid NN failover. And you could find the supported nameservice in the hdfs-site.xml in the key *dfs.nameservices*. Best, Yang On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal wrote: > So, when we create an EMR cluster the NN service runs on the primary node > of the cluster. > Now at the time of creating the cluster, how can we specify the name of > this NN in format hdfs://*namenode-host*:8020/. > > Is there a standard name by which we can identify the NN server ? > > Thanks > Sachin > > > On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera > wrote: > >> Hello Sachin, >> >> Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes >> down or VMs are required to be shut down for security updates or due to >> faults, new VMs will be added to the cluster. As a result, any data stored >> in the local file system, such as file://tmp, would be lost. To ensure data >> persistence and prevent loss of checkpoint or savepoint data for recovery, >> it is advisable to store such data in a persistent storage solution like >> HDFS or S3. >> >> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP >> details from EMR service. >> >> Hope this helps. >> >> -A >> >> >> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal >> wrote: >> >>> Hi, >>> We are using AWS EMR where we can submit our flink jobs to a long >>> running flink cluster on Yarn. >>> >>> We wanted to configure RocksDBStateBackend as our state backend to store >>> our checkpoints. >>> >>> So we have configured following properties in our flink-conf.yaml >>> >>>- state.backend.type: rocksdb >>>- state.checkpoints.dir: file:///tmp >>>- state.backend.incremental: true >>> >>> >>> My question here is regarding the checkpoint location: what is the >>> difference between the location if it is a local filesystem vs a hadoop >>> distributed file system (hdfs). >>> >>> What advantages we get if we use: >>> >>> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints >>> vs >>> *state.checkpoints.dir*: file:///tmp >>> >>> Also if we decide to use HDFS then from where we can get the value for >>> *namenode-host:port* >>> given we are running Flink on an EMR. >>> >>> Thanks >>> Sachin >>> >>> >>>
Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR
So, when we create an EMR cluster the NN service runs on the primary node of the cluster. Now at the time of creating the cluster, how can we specify the name of this NN in format hdfs://*namenode-host*:8020/. Is there a standard name by which we can identify the NN server ? Thanks Sachin On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera wrote: > Hello Sachin, > > Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes > down or VMs are required to be shut down for security updates or due to > faults, new VMs will be added to the cluster. As a result, any data stored > in the local file system, such as file://tmp, would be lost. To ensure data > persistence and prevent loss of checkpoint or savepoint data for recovery, > it is advisable to store such data in a persistent storage solution like > HDFS or S3. > > Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP > details from EMR service. > > Hope this helps. > > -A > > > On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal wrote: > >> Hi, >> We are using AWS EMR where we can submit our flink jobs to a long running >> flink cluster on Yarn. >> >> We wanted to configure RocksDBStateBackend as our state backend to store >> our checkpoints. >> >> So we have configured following properties in our flink-conf.yaml >> >>- state.backend.type: rocksdb >>- state.checkpoints.dir: file:///tmp >>- state.backend.incremental: true >> >> >> My question here is regarding the checkpoint location: what is the >> difference between the location if it is a local filesystem vs a hadoop >> distributed file system (hdfs). >> >> What advantages we get if we use: >> >> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints >> vs >> *state.checkpoints.dir*: file:///tmp >> >> Also if we decide to use HDFS then from where we can get the value for >> *namenode-host:port* >> given we are running Flink on an EMR. >> >> Thanks >> Sachin >> >> >>
Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR
Hello Sachin, Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes down or VMs are required to be shut down for security updates or due to faults, new VMs will be added to the cluster. As a result, any data stored in the local file system, such as file://tmp, would be lost. To ensure data persistence and prevent loss of checkpoint or savepoint data for recovery, it is advisable to store such data in a persistent storage solution like HDFS or S3. Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP details from EMR service. Hope this helps. -A On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal wrote: > Hi, > We are using AWS EMR where we can submit our flink jobs to a long running > flink cluster on Yarn. > > We wanted to configure RocksDBStateBackend as our state backend to store > our checkpoints. > > So we have configured following properties in our flink-conf.yaml > >- state.backend.type: rocksdb >- state.checkpoints.dir: file:///tmp >- state.backend.incremental: true > > > My question here is regarding the checkpoint location: what is the > difference between the location if it is a local filesystem vs a hadoop > distributed file system (hdfs). > > What advantages we get if we use: > > *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints > vs > *state.checkpoints.dir*: file:///tmp > > Also if we decide to use HDFS then from where we can get the value for > *namenode-host:port* > given we are running Flink on an EMR. > > Thanks > Sachin > > >
Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR
Hi, We are using AWS EMR where we can submit our flink jobs to a long running flink cluster on Yarn. We wanted to configure RocksDBStateBackend as our state backend to store our checkpoints. So we have configured following properties in our flink-conf.yaml - state.backend.type: rocksdb - state.checkpoints.dir: file:///tmp - state.backend.incremental: true My question here is regarding the checkpoint location: what is the difference between the location if it is a local filesystem vs a hadoop distributed file system (hdfs). What advantages we get if we use: *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints vs *state.checkpoints.dir*: file:///tmp Also if we decide to use HDFS then from where we can get the value for *namenode-host:port* given we are running Flink on an EMR. Thanks Sachin