Thanks Todd.
Finally we also started suspecting in that angle. Planned to take the file
details before reboot and after reboot.
With the above analysis i can confirm, whether the same issue or not.
One more thing to notice is that the difference between reboot time and last
replica finalization is ~1hr in some cases.
Since the machine is rebooted due to kernal.hung_task_timeout_secs , in OS also
that particular thread might not got the chance to sync the data.
great one, HDFS-1539, I have merged all the bugs. Since this is an improvement,
issue might not come to my list :( .
Also found some OS level configs to do the filesystem operations synchronously
dirsync
All directory updates within the filesystem should be done synchronously.
This affects the following system calls: creat, link, unlink, symlink, mkdir,
rmdir, mknod and rename.
We suspected mainly the rename operation lost after reboot. Since metafile ,
blockfile rename should happen when finalizing the block from BBW to current. (
at least not considered blocksize).
Anyway, thanks a lot for your great & valuable time with us here. After
checking the above OS logs, i will have a run with HDFS-1539.
Regards,
Uma
________________________________________
From: Todd Lipcon [[email protected]]
Sent: Thursday, November 24, 2011 5:07 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Blocks are getting corrupted under very high load
On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<[email protected]> wrote:
> Yes, Todd, block after restart is small and genstamp also lesser.
> Here complete machine reboot happend. The boards are configured like, if it
> is not getting any CPU cycles for 480secs, it will reboot himself.
> kernal.hung_task_timeout_secs = 480 sec.
So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica
If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera