I propose an alternate solution for this.
If the block information was managed by having a inotify task (in
linux/solaris), and the windows equivalent which I forget, the datanode
could be informed each time a file in the dfs tree is created, updated,
or deleted.
With this information being delivered, it can maintain an accurate block
map with only 1 full scan of the datanode blocks, at start time.
With this algorithm the data nodes will be able to scale to a much
larger number of blocks.
The other thing is the way the sync blocks on the FSDataset.FSVolumeSet
are held totally aggravates this bug in 0.18.1.
I have implemented a pure java version of inotify, using JNA
(https://jna.dev.java.net/) and there is a windows version also
available, or some simple jni could be written.
The ja...@attributor.com address will be going away shortly, I will be
switching to jason.had...@gmail.com in the next little bit.
Jason Venner wrote:
The problem we are having is that datanodes periodically stall for
10-15 minutes and drop off the active list and then come back.
What is going on is that a long operation set is holding the lock on
on FSDataset.volumes, and all of the other block service requests
stall behind this lock.
"DataNode: [/data/dfs-video-18/dfs/data]" daemon prio=10
tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0]
java.lang.Thread.State: RUNNABLE
at java.lang.String.lastIndexOf(String.java:1628)
at java.io.File.getName(File.java:399)
at
org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148)
at
org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
at
org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
at
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511)
- locked <0x551e8d48> (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
at java.lang.Thread.run(Thread.java:619)
This is basically taking a stat on every hdfs block on the datanode,
which in our case is ~ 2million, and can take 10+ minutes (we may be
experiencing problems with our raid controller but have no visibility
into it) at the OS level the file system seems fine and operations
eventually finish.
It appears that a couple of different data structures are being locked
with the single object FSDataset$Volume.
Then this happens:
"org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17" daemon prio=10
tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry
[0x459fe000..0x459ff0d0]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473)
- waiting to lock <0x551e8d48> (a
org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
- locked <0x54e550e0> (a org.apache.hadoop.dfs.FSDataset)
at
org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2322)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
at java.lang.Thread.run(Thread.java:619)
which locks the FSDataset while waiting on the volume object
and now all of the Datanode operations stall waiting on the FSDataset
object.
----------
Our particular installation doesn't use multiple directories for hdfs,
so a first simple hack for a local fix would be to modify
getNextVolume to just return the single volume and not be synchronized
A richer alternative would be to make the locking more fine grained on
FSDataset$FSVolumeSet.
Of course we are also trying to fix the file system performance and
dfs block loading that results in the block report taking a long time.
Any suggestions or warnings?
Thanks.