Harsh J created HDFS-4936: ----------------------------- Summary: Handle overflow condition for txid going over Long.MAX_VALUE Key: HDFS-4936 URL: https://issues.apache.org/jira/browse/HDFS-4936 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha Reporter: Harsh J Priority: Minor
Hat tip to [~fengdon...@gmail.com] for the question that lead to this (on mailing lists). I hacked up my local NN's txids manually to go very large (close to max) and decided to try out if this causes any harm. I basically bumped up the freshly formatted files' starting txid to 9223372036854775805 (and ensured image references the same by hex-editing it): {code} ➜ current ls VERSION fsimage_9223372036854775805.md5 fsimage_9223372036854775805 seen_txid ➜ current cat seen_txid 9223372036854775805 {code} NameNode started up as expected. {code} 13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 seconds. 13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid 9223372036854775805 from /temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805 13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at 9223372036854775806 {code} I could create a bunch of files and do regular ops (counting to much after the long max increments). I created over 10 files, just to make it go well over the Long.MAX_VALUE. Quitting NameNode and restarting fails though, with the following error: {code} 13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized segments in /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current 13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806 -> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807 13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join java.io.IOException: Gap in transactions. Expected to be able to read up until at least txid 9223372036854775806 but unable to find any edit logs containing txid -9223372036854775808 at org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205) {code} Looks like we also lose some edits when we restart, as noted by the finalized edits filename: {code} VERSION edits_9223372036854775806-9223372036854775807 fsimage_9223372036854775805 fsimage_9223372036854775805.md5 seen_txid {code} It seems like we won't be able to handle the case where txid overflows. Its a very very large number so that's not an immediate concern but seemed worthy of a report. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira