[ https://issues.apache.org/jira/browse/HDFS-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029201#comment-14029201 ]
Tony Reix commented on HDFS-6515: --------------------------------- More data: The "Failed to cache... message" is issued due to a method returning -1. This reserve() method from class UsedBytesCount from file FsDatasetCache.java is very simple : long reserve(long count) { count = rounder.round(count); while (true) { long cur = usedBytes.get(); long next = cur + count; if (next > maxBytes) { return -1; } if (usedBytes.compareAndSet(cur, next)) { return next; } } Adding some traces before and after the "if (next > maxBytes)e test, here is what we see on x86_64: @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 4096 | next: 8192 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 12288 | next: 16384 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 16384 | next: 20480 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A ........ about 30 more lines @@reserve:: count: 4096 | cur: 65536 | next: 69632 | maxBytes: 65536\0A <-------- ERROR -1 <-------- ERROR -1 Failed to cache 1073741841_BP-798842898-127.0.1.1-1402491181446: could not reserve 4096 more bytes in the cache: dfs.datanode.max.locked.memory of 65536 exceeded. Failed to cache 1073741842_BP-798842898-127.0.1.1-1402491181446: could not reserve 4096 more bytes in the cache: dfs.datanode.max.locked.memory of 65536 exceeded. and on PPC64: @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 4096 | next: 8192 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 12288 | next: 16384 | maxBytes: 65536\0A @@reserve:: count: 4096 | cur: 16384 | next: 20480 | maxBytes: 65536\0A @@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A @@reserve:: count: 65536 | cur:65536 | next: 131072 | maxBytes: 65536\0A <-------- ERROR -1 @@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A @@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A <-------- ERROR -1 Failed to cache 1073741825_BP-1511666005-127.0.1.1-1402063542502: could not reserve 65536 more bytes in the c ache: dfs.datanode.max.locked.memory of 65536 exceeded. This requires much more investigation. I'm now trying to understand what the HDFS cache mecanism is supposed to do. ================================= I noticed: x86_64: @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A PPC64 : @@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A count has not the same value. Looking at CachingTask.run() method : public void run() { boolean success = false; FileInputStream blockIn = null, metaIn = null; MappableBlock mappableBlock = null; ExtendedBlock extBlk = new ExtendedBlock(key.getBlockPoolId(), key.getBlockId(), length, genstamp); LOG.warn("@@run:: key: "+key+" | blockFileName: "+blockFileName+" | length: "+length+" | genstamp: "+genstamp); long newUsedBytes = usedBytesCount.reserve(length); and adding traces, I clearly see that, on PPC64, length changes from 512 to 65536, passing by 4096 and 16384, though it is always 512 or 4096 on x86_64. PPC64 : /finalized/blk_1073741825 | length: 512 | genstamp: 1001 /finalized/blk_1073741826 | length: 512 | genstamp: 1002 /finalized/blk_1073741828 | length: 512 | genstamp: 1004 /finalized/blk_1073741827 | length: 512 | genstamp: 1003 /finalized/blk_1073741829 | length: 512 | genstamp: 1005 /finalized/blk_1073741826 | length: 65536 | genstamp: 1002 /finalized/blk_1073741825 | length: 65536 | genstamp: 1001 /finalized/blk_1073741825 | length: 65536 | genstamp: 1001 /finalized/blk_1073741825 | length: 65536 | genstamp: 1001 /finalized/blk_1073741825 | length: 4096 | genstamp: 1001 /finalized/blk_1073741825 | length: 16384 | genstamp: 1001 /finalized/blk_1073741825 | length: 65536 | genstamp: 1001 /finalized/blk_1073741825 | length: 65536 | genstamp: 1001 x86_64: /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741830 | length: 4096 | genstamp: 1006\0A /finalized/blk_1073741831 | length: 4096 | genstamp: 1007\0A /finalized/blk_1073741832 | length: 4096 | genstamp: 1008\0A /finalized/blk_1073741833 | length: 4096 | genstamp: 1009\0A /finalized/blk_1073741834 | length: 4096 | genstamp: 1010\0A /finalized/blk_1073741835 | length: 4096 | genstamp: 1011\0A /finalized/blk_1073741836 | length: 4096 | genstamp: 1012\0A /finalized/blk_1073741837 | length: 4096 | genstamp: 1013\0A /finalized/blk_1073741838 | length: 4096 | genstamp: 1014\0A /finalized/blk_1073741839 | length: 4096 | genstamp: 1015\0A /finalized/blk_1073741840 | length: 4096 | genstamp: 1016\0A /finalized/blk_1073741841 | length: 4096 | genstamp: 1017\0A /finalized/blk_1073741842 | length: 4096 | genstamp: 1018\0A /finalized/blk_1073741843 | length: 4096 | genstamp: 1019\0A /finalized/blk_1073741844 | length: 4096 | genstamp: 1020\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741825 | length: 512 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 512 | genstamp: 1002\0A /finalized/blk_1073741828 | length: 512 | genstamp: 1004\0A /finalized/blk_1073741827 | length: 512 | genstamp: 1003\0A /finalized/blk_1073741829 | length: 512 | genstamp: 1005\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A /finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A /finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A /finalized/blk_1073741831 | length: 4096 | genstamp: 1007\0A /finalized/blk_1073741833 | length: 4096 | genstamp: 1009\0A /finalized/blk_1073741835 | length: 4096 | genstamp: 1011\0A /finalized/blk_1073741837 | length: 4096 | genstamp: 1013\0A /finalized/blk_1073741839 | length: 4096 | genstamp: 1015\0A /finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A /finalized/blk_1073741830 | length: 4096 | genstamp: 1006\0A /finalized/blk_1073741841 | length: 4096 | genstamp: 1017\0A /finalized/blk_1073741832 | length: 4096 | genstamp: 1008\0A /finalized/blk_1073741834 | length: 4096 | genstamp: 1010\0A /finalized/blk_1073741836 | length: 4096 | genstamp: 1012\0A /finalized/blk_1073741838 | length: 4096 | genstamp: 1014\0A /finalized/blk_1073741840 | length: 4096 | genstamp: 1016\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A /finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A ================================================================= Here are the main classes & methods dealing with getting the length: public class Block implements Writable, Comparable<Block> { ... public long getNumBytes() { return numBytes; } ... } public interface Replica { ... } abstract public class ReplicaInfo extends Block implements Replica { ... } public class FinalizedReplica extends ReplicaInfo { @Override public long getVisibleLength() { return getNumBytes(); // all bytes are visible } } class ReplicaMap { ... ReplicaInfo get(String bpid, long blockId) { checkBlockPool(bpid); synchronized(mutex) { Map<Long, ReplicaInfo> m = map.get(bpid); return m != null ? m.get(blockId) : null; } } ... } class FsDatasetImpl implements FsDatasetSpi<FsVolumeImpl { final ReplicaMap volumeMap; private void cacheBlock(String bpid, long blockId) { ... long length, genstamp; ... ReplicaInfo info = volumeMap.get(bpid, blockId); ... length = info.getVisibleLength(); ... cacheManager.cacheBlock(blockId, bpid, blockFileName, length, genstamp, volumeExecutor); ... } cacheBlock(long blockId, String bpid, String blockFileName, long length, long genstamp, Executor volumeExecutor) { .... volumeExecutor.execute(new CachingTask(key, blockFileName, length, genstamp)); } public class FsDatasetCache { ... private class CachingTask implements Runnable { ... private final long length; ... CachingTask(ExtendedBlockId key, String blockFileName, long length, long genstamp) { ... this.length = length; ... } @Override public void run() { .... ExtendedBlock extBlk = new ExtendedBlock(key.getBlockPoolId(), key.getBlockId(), length, genstamp); > testPageRounder (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) > ----------------------------------------------------------------------------- > > Key: HDFS-6515 > URL: https://issues.apache.org/jira/browse/HDFS-6515 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.4.0 > Environment: Linux on PPC64 > Reporter: Tony Reix > Priority: Blocker > Labels: test > > I have an issue with test : > testPageRounder > (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) > on Linux/PowerPC. > On Linux/Intel, test runs fine. > On Linux/PowerPC, I have: > testPageRounder(org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) > Time elapsed: 64.037 sec <<< ERROR! > java.lang.Exception: test timed out after 60000 milliseconds > Looking at details, I see that some "Failed to cache " messages appear in the > traces. Only 10 on Intel, but 186 on PPC64. > On PPC64, it looks like some thread is waiting for something that never > happens, generating a TimeOut. > I'm now using IBM JVM, however I've just checked that the issue also appears > with OpenJDK. > I'm now using Hadoop latest, however, the issue appeared within Hadoop 2.4.0 . > I need help for understanding what the test is doing, what traces are > expected, in order to understand what/where is the root cause. -- This message was sent by Atlassian JIRA (v6.2#6252)