Dear HBase Community,
We are experiencing an intermittent issue in our HBase cluster (version
1.4.14, HDFS 2.7.3, Zookeeper 3.4.10, 9 region servers, 2 masters).
Issue Details:
- Symptoms: Get operations intermittently return null for certain row
keys despite data presence.
- Duration: The issue persisted for two days and resolved on its own
without intervention.
- Timeline:
- Event: Full GC occurred on a region server.
- Action: Restarted the region server, leading to region assignment
issues.
- Troubleshooting: Used hbck fixAssignments but issues persisted.
Eventually restarted all region servers, stabilizing the cluster.
- Post-Stabilization: For two days, random Get queries returned null,
with no exceptions in region server, master, Zookeeper, or client logs.
- Resolution: Issue resolved itself after two days.
Logs Reviewed:
- Searched for keywords "WAL", "HLog", "flush", "replay", "corruption"
in region server and master logs.
- Checked Zookeeper logs for connectivity issues.
Questions:
1. What could cause intermittent null returns despite data presence?
2. Are there specific WAL or region server configurations to check?
3. What additional logs or steps should we review?
Any guidance would be appreciated.
Regards, Roshan B