Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 Thanks guys, I figured out what my issue was.I did not setup the ssh key 
correctly, it was for my user but I started the service as root.
Now it is working except none of the namenodes are transitioning to active on 
startup, and the datanodes are not starting automatically (I think because no 
namenode is active).
I can start everything manually though.

On Tuesday, October 3, 2023 at 11:03:33 AM PDT, Susheel Kumar Gadalay 
 wrote:  
 
 Why you have set this again in hdfs-site.xml at the end.
    dfs.namenode.rpc-address    nn1:8020  

Remove this and start name node again.
Regards Susheel Kumar On Tue, 3 Oct 2023, 10:09 pm Harry Jamison, 
 wrote:

 OK here is where I am at now.
When I start the namenodes, they work, but they are all in standby mode.When I 
start my first datanode it seems to kill one of the namenodes (the active one I 
assume)
I am getting 2 different warnings in the namenode
[2023-10-03 09:03:52,162] WARN Unable to initialize FileSignerSecretProvider, 
falling back to use random secrets. Reason: Could not read signature secret 
file: /root/hadoop-http-auth-signature-secret 
(org.apache.hadoop.security.authentication.server.AuthenticationFilter)

[2023-10-03 09:03:52,350] WARN Only one image storage directory 
(dfs.namenode.name.dir) configured. Beware of data loss due to lack of 
redundant storage directories! 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)

I am using a journal node, so I am not clear if I am supposed to have multiple 
dfs.namenode.name.dir directoriesI thought each namenode has 1 directory.

Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am not 
clear how it is wrongFrom here mine looks 
righthttps://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

This is what is in the logs right before the namenode dies[2023-10-03 
09:01:22,054] INFO Listener at vmnode3:8020 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,054] INFO Starting Socket 
Reader #1 for port 8020 (org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,097] 
INFO Registered FSNamesystemState, ReplicatedBlocksState and ECBlockGroupsState 
MBeans. (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 
09:01:22,119] INFO Number of blocks under construction: 0 
(org.apache.hadoop.hdfs.server.namenode.LeaseManager)[2023-10-03 09:01:22,122] 
INFO Initialized the Default Decommission and Maintenance monitor 
(org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)[2023-10-03
 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
Network topology has 0 racks and 0 datanodes 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
UnderReplicatedBlocks has 0 blocks 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,130] INFO Start 
MarkedDeleteBlockScrubber thread 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)[2023-10-03 
09:01:22,158] INFO IPC Server Responder: starting 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,159] INFO IPC Server 
listener on 8020: starting (org.apache.hadoop.ipc.Server)[2023-10-03 
09:01:22,165] INFO NameNode RPC up at: vmnode3/192.168.1.103:8020 
(org.apache.hadoop.hdfs.server.namenode.NameNode)[2023-10-03 09:01:22,166] INFO 
Starting services required for standby state 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 09:01:22,168] 
INFO Will roll logs on active node every 120 seconds. 
(org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)[2023-10-03 
09:01:22,171] INFO Starting standby checkpoint thread...Checkpointing active NN 
to possible NNs: [http://vmnode1:9870, http://vmnode2:9870]Serving checkpoints 
at http://vmnode3:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)real-time 
non-blocking time  (microseconds, -R) unlimitedcore file size              
(blocks, -c) 0data seg size               (kbytes, -d) unlimitedscheduling 
priority                 (-e) 0file size                   (blocks, -f) 
unlimitedpending signals                     (-i) 15187max locked memory        
   (kbytes, -l) 8192max memory size             (kbytes, -m) unlimitedopen 
files                          (-n) 1024pipe size                (512 bytes, 
-p) 8POSIX message queues         (bytes, -q) 819200real-time priority          
        (-r) 0stack size                  (kbytes, -s) 8192cpu time             
      (seconds, -t) unlimitedmax user processes                  (-u) 
15187virtual memory              (kbytes, -v) unlimitedfile locks               
           (-x) unlimited






On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui 
 wrote:  
 
 Harry,
Great question.I would say the same configurations in core-site.xml and 
hdfs-site.xml will be overwriting each other in some way.
Glad you found the root cause.
Keep going.
On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison  wrote:

 Liming 
After looking at my 

Re: HDFS HA namenode issue

2023-10-03 Thread Susheel Kumar Gadalay
Why you have set this again in hdfs-site.xml at the end.


dfs.namenode.rpc-address
nn1:8020
  

Remove this and start name node again.

Regards
Susheel Kumar
On Tue, 3 Oct 2023, 10:09 pm Harry Jamison,
 wrote:

> OK here is where I am at now.
>
> When I start the namenodes, they work, but they are all in standby mode.
> When I start my first datanode it seems to kill one of the namenodes (the
> active one I assume)
>
> I am getting 2 different warnings in the namenode
>
> [2023-10-03 09:03:52,162] WARN Unable to initialize
> FileSignerSecretProvider, falling back to use random secrets. Reason: Could
> not read signature secret file: /root/hadoop-http-auth-signature-secret
> (org.apache.hadoop.security.authentication.server.AuthenticationFilter)
>
> [2023-10-03 09:03:52,350] WARN Only one image storage directory
> (dfs.namenode.name.dir) configured. Beware of data loss due to lack of
> redundant storage directories!
> (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
>
> I am using a journal node, so I am not clear if I am supposed to have
> multiple dfs.namenode.name.dir directories
> I thought each namenode has 1 directory.
>
>
> Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am
> not clear how it is wrong
> From here mine looks right
>
> https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
>
> This is what is in the logs right before the namenode dies
> [2023-10-03 09:01:22,054] INFO Listener at vmnode3:8020
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,054] INFO Starting Socket Reader #1 for port 8020
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,097] INFO Registered FSNamesystemState,
> ReplicatedBlocksState and ECBlockGroupsState MBeans.
> (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
> [2023-10-03 09:01:22,119] INFO Number of blocks under construction: 0
> (org.apache.hadoop.hdfs.server.namenode.LeaseManager)
> [2023-10-03 09:01:22,122] INFO Initialized the Default Decommission and
> Maintenance monitor
> (org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)
> [2023-10-03 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,131] INFO STATE* Network topology has 0 racks and 0
> datanodes (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,131] INFO STATE* UnderReplicatedBlocks has 0 blocks
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-03 09:01:22,130] INFO Start MarkedDeleteBlockScrubber thread
> (org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)
> [2023-10-03 09:01:22,158] INFO IPC Server Responder: starting
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,159] INFO IPC Server listener on 8020: starting
> (org.apache.hadoop.ipc.Server)
> [2023-10-03 09:01:22,165] INFO NameNode RPC up at: vmnode3/
> 192.168.1.103:8020 (org.apache.hadoop.hdfs.server.namenode.NameNode)
> [2023-10-03 09:01:22,166] INFO Starting services required for standby
> state (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)
> [2023-10-03 09:01:22,168] INFO Will roll logs on active node every 120
> seconds. (org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)
> [2023-10-03 09:01:22,171] INFO Starting standby checkpoint thread...
> Checkpointing active NN to possible NNs: [http://vmnode1:9870,
> http://vmnode2:9870]
> Serving checkpoints at http://vmnode3:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
> real-time non-blocking time  (microseconds, -R) unlimited
> core file size  (blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 15187
> max locked memory   (kbytes, -l) 8192
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 15187
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
>
>
>
>
>
>
> On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Harry,
>
> Great question.
> I would say the same configurations in core-site.xml and hdfs-site.xml
> will be overwriting each other in some way.
>
> Glad you found the root cause.
>
> Keep going.
>
> On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison 
> wrote:
>
> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it 

Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 OK here is where I am at now.
When I start the namenodes, they work, but they are all in standby mode.When I 
start my first datanode it seems to kill one of the namenodes (the active one I 
assume)
I am getting 2 different warnings in the namenode
[2023-10-03 09:03:52,162] WARN Unable to initialize FileSignerSecretProvider, 
falling back to use random secrets. Reason: Could not read signature secret 
file: /root/hadoop-http-auth-signature-secret 
(org.apache.hadoop.security.authentication.server.AuthenticationFilter)

[2023-10-03 09:03:52,350] WARN Only one image storage directory 
(dfs.namenode.name.dir) configured. Beware of data loss due to lack of 
redundant storage directories! 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)

I am using a journal node, so I am not clear if I am supposed to have multiple 
dfs.namenode.name.dir directoriesI thought each namenode has 1 directory.

Susheel Kumar Gadalay said that my shared.edits.dir Is wrong, but I am not 
clear how it is wrongFrom here mine looks 
righthttps://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

This is what is in the logs right before the namenode dies[2023-10-03 
09:01:22,054] INFO Listener at vmnode3:8020 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,054] INFO Starting Socket 
Reader #1 for port 8020 (org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,097] 
INFO Registered FSNamesystemState, ReplicatedBlocksState and ECBlockGroupsState 
MBeans. (org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 
09:01:22,119] INFO Number of blocks under construction: 0 
(org.apache.hadoop.hdfs.server.namenode.LeaseManager)[2023-10-03 09:01:22,122] 
INFO Initialized the Default Decommission and Maintenance monitor 
(org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminDefaultMonitor)[2023-10-03
 09:01:22,131] INFO STATE* Leaving safe mode after 0 secs 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
Network topology has 0 racks and 0 datanodes 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,131] INFO STATE* 
UnderReplicatedBlocks has 0 blocks 
(org.apache.hadoop.hdfs.StateChange)[2023-10-03 09:01:22,130] INFO Start 
MarkedDeleteBlockScrubber thread 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager)[2023-10-03 
09:01:22,158] INFO IPC Server Responder: starting 
(org.apache.hadoop.ipc.Server)[2023-10-03 09:01:22,159] INFO IPC Server 
listener on 8020: starting (org.apache.hadoop.ipc.Server)[2023-10-03 
09:01:22,165] INFO NameNode RPC up at: vmnode3/192.168.1.103:8020 
(org.apache.hadoop.hdfs.server.namenode.NameNode)[2023-10-03 09:01:22,166] INFO 
Starting services required for standby state 
(org.apache.hadoop.hdfs.server.namenode.FSNamesystem)[2023-10-03 09:01:22,168] 
INFO Will roll logs on active node every 120 seconds. 
(org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer)[2023-10-03 
09:01:22,171] INFO Starting standby checkpoint thread...Checkpointing active NN 
to possible NNs: [http://vmnode1:9870, http://vmnode2:9870]Serving checkpoints 
at http://vmnode3:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)real-time 
non-blocking time  (microseconds, -R) unlimitedcore file size              
(blocks, -c) 0data seg size               (kbytes, -d) unlimitedscheduling 
priority                 (-e) 0file size                   (blocks, -f) 
unlimitedpending signals                     (-i) 15187max locked memory        
   (kbytes, -l) 8192max memory size             (kbytes, -m) unlimitedopen 
files                          (-n) 1024pipe size                (512 bytes, 
-p) 8POSIX message queues         (bytes, -q) 819200real-time priority          
        (-r) 0stack size                  (kbytes, -s) 8192cpu time             
      (seconds, -t) unlimitedmax user processes                  (-u) 
15187virtual memory              (kbytes, -v) unlimitedfile locks               
           (-x) unlimited






On Tuesday, October 3, 2023 at 03:54:23 AM PDT, Liming Cui 
 wrote:  
 
 Harry,
Great question.I would say the same configurations in core-site.xml and 
hdfs-site.xml will be overwriting each other in some way.
Glad you found the root cause.
Keep going.
On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison  wrote:

 Liming 
After looking at my config, I think that maybe my problem is because my 
fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xmlWhat does 
hdfs-site.xml vs core-site.xml do why is the same setting in 2 different 
places?Or do I just have it there mistakenly?
this is what I have in hdfs-site.xml
        
fs.defaultFS      hdfs://mycluster     
    ha.zookeeper.quorum    
nn1:2181,nn2:2181,nn3:2181  
      dfs.nameservices    mycluster  

      dfs.ha.namenodes.mycluster    
nn1,nn2,nn3  
      dfs.namenode.rpc-address.mycluster.nn1    
nn1:8020        
dfs.namenode.rpc-address.mycluster.nn2    nn2:8020  
      dfs.namenode.rpc-address.mycluster.nn3  
  nn3:8020  
      

Re: HDFS HA namenode issue

2023-10-03 Thread Liming Cui
Harry,

Great question.
I would say the same configurations in core-site.xml and hdfs-site.xml will
be overwriting each other in some way.

Glad you found the root cause.

Keep going.

On Tue, Oct 3, 2023 at 10:27 AM Harry Jamison 
wrote:

> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
>
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size   

Re: HDFS HA namenode issue

2023-10-03 Thread Susheel Kumar Gadalay
The core-site.xml configuration settings will be overridden by
hdfs-site.xml, mapred-site.xml, yarn-site.xml. This was like that but don't
know if it is changed now.

Look at your shared.edits.dir configuration. You have not set it correct
across name nodes.

Regards


On Tue, 3 Oct 2023, 1:59 pm Harry Jamison, 
wrote:

> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS
> is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
>
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui <
> anyone.cui...@gmail.com> wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 

Re: HDFS HA namenode issue

2023-10-03 Thread Ayush Saxena
> Or do I just have it there mistakenly?

Yes, It should be in core-site.xml

It is there in the HA doc
```

fs.defaultFS - the default path prefix used by the Hadoop FS client
when none is given

Optionally, you may now configure the default path for Hadoop clients
to use the new HA-enabled logical URI. If you used “mycluster” as the
nameservice ID earlier, this will be the value of the authority
portion of all of your HDFS paths. This may be configured like so, in
your core-site.xml file:

```

-Ayush

On Tue, 3 Oct 2023 at 13:58, Harry Jamison
 wrote:
>
> Liming
>
> After looking at my config, I think that maybe my problem is because my 
> fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xml
> What does hdfs-site.xml vs core-site.xml do why is the same setting in 2 
> different places?
> Or do I just have it there mistakenly?
>
> this is what I have in hdfs-site.xml
>
> 
> 
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
>
>   
> ha.zookeeper.quorum
> nn1:2181,nn2:2181,nn3:2181
>   
>
>   
> dfs.nameservices
> mycluster
>   
>
>   
> dfs.ha.namenodes.mycluster
> nn1,nn2,nn3
>   
>
>   
> dfs.namenode.rpc-address.mycluster.nn1
> nn1:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn2
> nn2:8020
>   
>   
> dfs.namenode.rpc-address.mycluster.nn3
> nn3:8020
>   
>
>   
> dfs.namenode.http-address.mycluster.nn1
> nn1:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn2
> nn2:9870
>   
>   
> dfs.namenode.http-address.mycluster.nn3
> nn3:9870
>   
>
>   
> dfs.namenode.shared.edits.dir
> qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster
>   
>   
> dfs.client.failover.proxy.provider.mycluster
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>   
>
>   
> dfs.ha.fencing.methods
> sshfence
>   
>
>   
> dfs.ha.fencing.ssh.private-key-files
> /home/harry/.ssh/id_rsa
>   
>
>   
> dfs.namenode.name.dir
> file:/hadoop/data/hdfs/namenode
>   
>   
> dfs.datanode.data.dir
> file:/hadoop/data/hdfs/datanode
>   
>   
> dfs.journalnode.edits.dir
> /hadoop/data/hdfs/journalnode
>   
>   
> dfs.namenode.rpc-address
> nn1:8020
>   
>
>   
> dfs.ha.nn.not-become-active-in-safemode
> true
>   
>
> 
>
>
>
> In core-site.xml I have this
>
> 
>
> 
>
> 
>
>
> 
>
>
> 
>
>   
>
> fs.defaultFS
>
> hdfs://nn1:8020
>
>   
>
>
> 
>
>
>
> On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui 
>  wrote:
>
>
> Can you show us the configuration files?
> Maybe I can help you with some suggestions.
>
>
> On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison 
>  wrote:
>
> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode 
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal 
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing wrong 
> here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870 
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM 
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> / 
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no edit 
> logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority   

Re: HDFS HA namenode issue

2023-10-03 Thread Harry Jamison
 Liming 
After looking at my config, I think that maybe my problem is because my 
fs.defaultFS is inconsistent between hdfs-site.xml and core-site.xmlWhat does 
hdfs-site.xml vs core-site.xml do why is the same setting in 2 different 
places?Or do I just have it there mistakenly?
this is what I have in hdfs-site.xml
        
fs.defaultFS      hdfs://mycluster     
    ha.zookeeper.quorum    
nn1:2181,nn2:2181,nn3:2181  
      dfs.nameservices    mycluster  

      dfs.ha.namenodes.mycluster    
nn1,nn2,nn3  
      dfs.namenode.rpc-address.mycluster.nn1    
nn1:8020        
dfs.namenode.rpc-address.mycluster.nn2    nn2:8020  
      dfs.namenode.rpc-address.mycluster.nn3  
  nn3:8020  
      dfs.namenode.http-address.mycluster.nn1    
nn1:9870        
dfs.namenode.http-address.mycluster.nn2    nn2:9870 
       
dfs.namenode.http-address.mycluster.nn3    nn3:9870 
 
      dfs.namenode.shared.edits.dir    
qjournal://nn1:8485;nn2:8485;nn3:8485/mycluster    
    dfs.client.failover.proxy.provider.mycluster    
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
  
      dfs.ha.fencing.methods    sshfence  

      dfs.ha.fencing.ssh.private-key-files    
/home/harry/.ssh/id_rsa  
      dfs.namenode.name.dir    
file:/hadoop/data/hdfs/namenode        
dfs.datanode.data.dir    
file:/hadoop/data/hdfs/datanode        
dfs.journalnode.edits.dir    
/hadoop/data/hdfs/journalnode        
dfs.namenode.rpc-address    nn1:8020  
      dfs.ha.nn.not-become-active-in-safemode    
true  



In core-site.xml I have this
















  

    fs.defaultFS

    hdfs://nn1:8020

  







On Tuesday, October 3, 2023 at 12:54:26 AM PDT, Liming Cui 
 wrote:  
 
 Can you show us the configuration files? Maybe I can help you with some 
suggestions.

On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison  
wrote:

I am trying to setup a HA HDFS cluster, and I am running into a problem
I am not sure what I am doing wrong, I thought I followed the HA namenode 
guide, but it is not working.

Apache Hadoop 3.3.6 – HDFS High Availability


I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
After some period of time I see the following and my namenode and journal node 
die.I am not sure where the problem is, or how to diagnose what I am doing 
wrong here.  And the logging here does not make sense to me.
NamenodeServing checkpoints at http://nn1:9870 
(org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
real-time non-blocking time  (microseconds, -R) unlimited

core file size              (blocks, -c) 0

data seg size               (kbytes, -d) unlimited

scheduling priority                 (-e) 0

file size                   (blocks, -f) unlimited

pending signals                     (-i) 15187

max locked memory           (kbytes, -l) 8192

max memory size             (kbytes, -m) unlimited

open files                          (-n) 1024

pipe size                (512 bytes, -p) 8

POSIX message queues         (bytes, -q) 819200

real-time priority                  (-r) 0

stack size                  (kbytes, -s) 8192

cpu time                   (seconds, -t) unlimited

max user processes                  (-u) 15187

virtual memory              (kbytes, -v) unlimited

file locks                          (-x) unlimited

[2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM 
(org.apache.hadoop.hdfs.server.namenode.NameNode)

[2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG: 

/

SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159

/ 
(org.apache.hadoop.hdfs.server.namenode.NameNode)

JournalNode[2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has 
no edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
real-time non-blocking time  (microseconds, -R) unlimited

core file size              (blocks, -c) 0

data seg size               (kbytes, -d) unlimited

scheduling priority                 (-e) 0

file size                   (blocks, -f) unlimited

pending signals                     (-i) 15187

max locked memory           (kbytes, -l) 8192

max memory size             (kbytes, -m) unlimited

open files                          (-n) 1024

pipe size                (512 bytes, -p) 8

POSIX message queues         (bytes, -q) 819200

real-time priority                  (-r) 0

stack size                  (kbytes, -s) 8192

cpu time                   (seconds, -t) unlimited

max user processes                  (-u) 15187

virtual memory              (kbytes, -v) unlimited

file locks                          (-x) unlimited





-- 
Best
Liming  

Re: HDFS HA namenode issue

2023-10-03 Thread Liming Cui
Can you show us the configuration files?
Maybe I can help you with some suggestions.


On Tue, Oct 3, 2023 at 9:05 AM Harry Jamison
 wrote:

> I am trying to setup a HA HDFS cluster, and I am running into a problem
>
> I am not sure what I am doing wrong, I thought I followed the HA namenode
> guide, but it is not working.
>
>
> Apache Hadoop 3.3.6 – HDFS High Availability
> 
>
>
>
> I have 2 namenodes and 3 journal nodes, and 3 zookeeper nodes.
>
> After some period of time I see the following and my namenode and journal
> node die.
> I am not sure where the problem is, or how to diagnose what I am doing
> wrong here.  And the logging here does not make sense to me.
>
> Namenode
> Serving checkpoints at http://nn1:9870
> (org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
> [2023-10-02 23:53:46,693] ERROR RECEIVED SIGNAL 15: SIGTERM
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> [2023-10-02 23:53:46,701] INFO SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at nn1/192.168.1.159
>
> /
> (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> JournalNode
> [2023-10-02 23:54:19,162] WARN Journal at nn1/192.168.1.159:8485 has no
> edit logs (org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer)
>
> real-time non-blocking time  (microseconds, -R) unlimited
>
> core file size  (blocks, -c) 0
>
> data seg size   (kbytes, -d) unlimited
>
> scheduling priority (-e) 0
>
> file size   (blocks, -f) unlimited
>
> pending signals (-i) 15187
>
> max locked memory   (kbytes, -l) 8192
>
> max memory size (kbytes, -m) unlimited
>
> open files  (-n) 1024
>
> pipe size(512 bytes, -p) 8
>
> POSIX message queues (bytes, -q) 819200
>
> real-time priority  (-r) 0
>
> stack size  (kbytes, -s) 8192
>
> cpu time   (seconds, -t) unlimited
>
> max user processes  (-u) 15187
>
> virtual memory  (kbytes, -v) unlimited
>
> file locks  (-x) unlimited
>
>
>

-- 
*Best*

Liming