Re: 0.18.1 datanode psuedo deadlock problem

Jason Venner Fri, 09 Jan 2009 20:59:36 -0800

I propose an alternate solution for this.

If the block information was managed by having a inotify task (inlinux/solaris), and the windows equivalent which I forget, the datanodecould be informed each time a file in the dfs tree is created, updated,or deleted.

With this information being delivered, it can maintain an accurate blockmap with only 1 full scan of the datanode blocks, at start time.

With this algorithm the data nodes will be able to scale to a muchlarger number of blocks.

The other thing is the way the sync blocks on the FSDataset.FSVolumeSetare held totally aggravates this bug in 0.18.1.

I have implemented a pure java version of inotify, using JNA(https://jna.dev.java.net/) and there is a windows version alsoavailable, or some simple jni could be written.

The ja...@attributor.com address will be going away shortly, I will beswitching to jason.had...@gmail.com in the next little bit.




Jason Venner wrote:

The problem we are having is that datanodes periodically stall for10-15 minutes and drop off the active list and then come back.
What is going on is that a long operation set is holding the lock onon FSDataset.volumes, and all of the other block service requestsstall behind this lock.
"DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
  java.lang.Thread.State: RUNNABLE
   at java.lang.String.lastIndexOf(String.java:1628)
   at java.io.File.getName(File.java:399)
atorg.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148)atorg.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)atorg.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)atorg.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511)
   - locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
   at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
   at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
   at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
   at java.lang.Thread.run(Thread.java:619)
This is basically taking a stat on every hdfs block on the datanode,which in our case is ~ 2million, and can take 10+ minutes (we may beexperiencing problems with our raid controller but have no visibilityinto it) at the OS level the file system seems fine and operationseventually finish.
It appears that a couple of different data structures are being lockedwith the single object FSDataset$Volume.
Then this happens:
"org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17" daemon prio=10tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry[0x459fe000..0x459ff0d0]
  java.lang.Thread.State: BLOCKED (on object monitor)
atorg.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473)- waiting to lock <0x551e8d48> (aorg.apache.hadoop.dfs.FSDataset$FSVolumeSet)
   at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
   - locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
atorg.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)atorg.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
   at java.lang.Thread.run(Thread.java:619)

which locks the FSDataset while waiting on the volume object
and now all of the Datanode operations stall waiting on the FSDatasetobject.
----------
Our particular installation doesn't use multiple directories for hdfs,so a first simple hack for a local fix would be to modifygetNextVolume to just return the single volume and not be synchronized
A richer alternative would be to make the locking more fine grained onFSDataset$FSVolumeSet.
Of course we are also trying to fix the file system performance anddfs block loading that results in the block report taking a long time.
Any suggestions or warnings?

Thanks.

Re: 0.18.1 datanode psuedo deadlock problem

Reply via email to