[ 
https://issues.apache.org/jira/browse/HDFS-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029201#comment-14029201
 ] 

Tony Reix commented on HDFS-6515:
---------------------------------

More data:

The "Failed to cache... message" is issued due to a method returning -1.
This reserve() method from class UsedBytesCount from file FsDatasetCache.java 
is very simple :

    long reserve(long count) {
      count = rounder.round(count);
      while (true) {
        long cur = usedBytes.get();
        long next = cur + count;
        if (next > maxBytes) {
          return -1;
        }
        if (usedBytes.compareAndSet(cur, next)) {
          return next;
        }
      }

Adding some traces before and after the "if (next > maxBytes)e test,

here is what we see on x86_64:

@@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 4096 | next: 8192 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 12288 | next: 16384 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 16384 | next: 20480 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A
........ about 30 more lines
@@reserve:: count: 4096 | cur: 65536 | next: 69632 | maxBytes: 65536\0A
  <-------- ERROR -1
  <-------- ERROR -1
Failed to cache 1073741841_BP-798842898-127.0.1.1-1402491181446: could not 
reserve 4096 more bytes in the cache: dfs.datanode.max.locked.memory of 65536 
exceeded.
Failed to cache 1073741842_BP-798842898-127.0.1.1-1402491181446: could not 
reserve 4096 more bytes in the cache: dfs.datanode.max.locked.memory of 65536 
exceeded.

and on PPC64:

@@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 4096 | next: 8192 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 8192 | next: 12288 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 12288 | next: 16384 | maxBytes: 65536\0A
@@reserve:: count: 4096 | cur: 16384 | next: 20480 | maxBytes: 65536\0A
@@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A
@@reserve:: count: 65536 | cur:65536 | next: 131072 | maxBytes: 65536\0A
  <-------- ERROR -1
@@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A
@@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A
  <-------- ERROR -1
Failed to cache 1073741825_BP-1511666005-127.0.1.1-1402063542502: could not 
reserve 65536 more bytes in the c ache: dfs.datanode.max.locked.memory of 65536 
exceeded.

This requires much more investigation. I'm now trying to understand what the 
HDFS cache mecanism is supposed to do.

=================================

I noticed:
x86_64: @@reserve:: count: 4096 | cur: 0 | next: 4096 | maxBytes: 65536\0A
PPC64 : @@reserve:: count: 65536 | cur: 0 | next: 65536 | maxBytes: 65536\0A

count has not the same value.

Looking at CachingTask.run() method :
    public void run() {
      boolean success = false;
      FileInputStream blockIn = null, metaIn = null;
      MappableBlock mappableBlock = null;
      ExtendedBlock extBlk = new ExtendedBlock(key.getBlockPoolId(),
          key.getBlockId(), length, genstamp);
      LOG.warn("@@run:: key: "+key+" | blockFileName: "+blockFileName+" | 
length: "+length+" | genstamp: "+genstamp);
      long newUsedBytes = usedBytesCount.reserve(length);
and adding traces, I clearly see that, on PPC64, length changes from 512 to 
65536, passing by 4096 and 16384, though it is always 512 or 4096 on x86_64.

PPC64 :
/finalized/blk_1073741825 | length: 512 | genstamp: 1001
/finalized/blk_1073741826 | length: 512 | genstamp: 1002
/finalized/blk_1073741828 | length: 512 | genstamp: 1004
/finalized/blk_1073741827 | length: 512 | genstamp: 1003
/finalized/blk_1073741829 | length: 512 | genstamp: 1005
/finalized/blk_1073741826 | length: 65536 | genstamp: 1002
/finalized/blk_1073741825 | length: 65536 | genstamp: 1001
/finalized/blk_1073741825 | length: 65536 | genstamp: 1001
/finalized/blk_1073741825 | length: 65536 | genstamp: 1001
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001
/finalized/blk_1073741825 | length: 16384 | genstamp: 1001
/finalized/blk_1073741825 | length: 65536 | genstamp: 1001
/finalized/blk_1073741825 | length: 65536 | genstamp: 1001

x86_64:
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741830 | length: 4096 | genstamp: 1006\0A
/finalized/blk_1073741831 | length: 4096 | genstamp: 1007\0A
/finalized/blk_1073741832 | length: 4096 | genstamp: 1008\0A
/finalized/blk_1073741833 | length: 4096 | genstamp: 1009\0A
/finalized/blk_1073741834 | length: 4096 | genstamp: 1010\0A
/finalized/blk_1073741835 | length: 4096 | genstamp: 1011\0A
/finalized/blk_1073741836 | length: 4096 | genstamp: 1012\0A
/finalized/blk_1073741837 | length: 4096 | genstamp: 1013\0A
/finalized/blk_1073741838 | length: 4096 | genstamp: 1014\0A
/finalized/blk_1073741839 | length: 4096 | genstamp: 1015\0A
/finalized/blk_1073741840 | length: 4096 | genstamp: 1016\0A
/finalized/blk_1073741841 | length: 4096 | genstamp: 1017\0A
/finalized/blk_1073741842 | length: 4096 | genstamp: 1018\0A
/finalized/blk_1073741843 | length: 4096 | genstamp: 1019\0A
/finalized/blk_1073741844 | length: 4096 | genstamp: 1020\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741825 | length: 512 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 512 | genstamp: 1002\0A
/finalized/blk_1073741828 | length: 512 | genstamp: 1004\0A
/finalized/blk_1073741827 | length: 512 | genstamp: 1003\0A
/finalized/blk_1073741829 | length: 512 | genstamp: 1005\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741826 | length: 4096 | genstamp: 1002\0A
/finalized/blk_1073741827 | length: 4096 | genstamp: 1003\0A
/finalized/blk_1073741829 | length: 4096 | genstamp: 1005\0A
/finalized/blk_1073741831 | length: 4096 | genstamp: 1007\0A
/finalized/blk_1073741833 | length: 4096 | genstamp: 1009\0A
/finalized/blk_1073741835 | length: 4096 | genstamp: 1011\0A
/finalized/blk_1073741837 | length: 4096 | genstamp: 1013\0A
/finalized/blk_1073741839 | length: 4096 | genstamp: 1015\0A
/finalized/blk_1073741828 | length: 4096 | genstamp: 1004\0A
/finalized/blk_1073741830 | length: 4096 | genstamp: 1006\0A
/finalized/blk_1073741841 | length: 4096 | genstamp: 1017\0A
/finalized/blk_1073741832 | length: 4096 | genstamp: 1008\0A
/finalized/blk_1073741834 | length: 4096 | genstamp: 1010\0A
/finalized/blk_1073741836 | length: 4096 | genstamp: 1012\0A
/finalized/blk_1073741838 | length: 4096 | genstamp: 1014\0A
/finalized/blk_1073741840 | length: 4096 | genstamp: 1016\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A
/finalized/blk_1073741825 | length: 4096 | genstamp: 1001\0A


=================================================================


Here are the main classes & methods dealing with getting the length:

public class Block implements Writable, Comparable<Block> {
...
  public long getNumBytes() {
    return numBytes;
  }
...
}

public interface Replica {
...
}

abstract public class ReplicaInfo extends Block implements Replica {
...
}

public class FinalizedReplica extends ReplicaInfo {
  @Override
  public long getVisibleLength() {
    return getNumBytes(); // all bytes are visible
  }
}

class ReplicaMap {
...
  ReplicaInfo get(String bpid, long blockId) {
    checkBlockPool(bpid);
    synchronized(mutex) {
      Map<Long, ReplicaInfo> m = map.get(bpid);
      return m != null ? m.get(blockId) : null;
    }
  }
...
}

class FsDatasetImpl implements FsDatasetSpi<FsVolumeImpl {

final ReplicaMap volumeMap;

private void cacheBlock(String bpid, long blockId) {
...
    long length, genstamp;
...
   ReplicaInfo info = volumeMap.get(bpid, blockId);
...
   length = info.getVisibleLength();
...
    cacheManager.cacheBlock(blockId, bpid, blockFileName, length, genstamp, 
volumeExecutor);
...
}

cacheBlock(long blockId, String bpid, String blockFileName, long length, long 
genstamp, Executor volumeExecutor)
{
....
         volumeExecutor.execute(new CachingTask(key, blockFileName, length, 
genstamp));
}

public class FsDatasetCache {
...

private class CachingTask implements Runnable {
...
    private final long length;
...

    CachingTask(ExtendedBlockId key, String blockFileName, long length, long 
genstamp) {
...
      this.length = length;
...
    }

    @Override
    public void run() {
....
      ExtendedBlock extBlk = new ExtendedBlock(key.getBlockPoolId(), 
key.getBlockId(), length, genstamp);


> testPageRounder   (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache)
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-6515
>                 URL: https://issues.apache.org/jira/browse/HDFS-6515
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.4.0
>         Environment: Linux on PPC64
>            Reporter: Tony Reix
>            Priority: Blocker
>              Labels: test
>
> I have an issue with test :
>    testPageRounder
>   (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache)
> on Linux/PowerPC.
> On Linux/Intel, test runs fine.
> On Linux/PowerPC, I have:
> testPageRounder(org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache)  
> Time elapsed: 64.037 sec  <<< ERROR!
> java.lang.Exception: test timed out after 60000 milliseconds
> Looking at details, I see that some "Failed to cache " messages appear in the 
> traces. Only 10 on Intel, but 186 on PPC64.
> On PPC64, it looks like some thread is waiting for something that never 
> happens, generating a TimeOut.
> I'm now using IBM JVM, however I've just checked that the issue also appears 
> with OpenJDK.
> I'm now using Hadoop latest, however, the issue appeared within Hadoop 2.4.0 .
> I need help for understanding what the test is doing, what traces are 
> expected, in order to understand what/where is the root cause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to