[jira] [Created] (HDFS-8520) Patch for PPC64
Tony Reix created HDFS-8520: --- Summary: Patch for PPC64 Key: HDFS-8520 URL: https://issues.apache.org/jira/browse/HDFS-8520 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Environment: RHEL 7.1 /PPC64 Reporter: Tony Reix Fix For: 2.7.1 The attached patch enables Hadoop to work on PPC64. That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64. There are changes in 3 files: - hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE The patch has been built on branch-2.7 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8519) Patch for PPC64
Tony Reix created HDFS-8519: --- Summary: Patch for PPC64 Key: HDFS-8519 URL: https://issues.apache.org/jira/browse/HDFS-8519 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Environment: RHEL 7.1 /PPC64 Reporter: Tony Reix Fix For: 2.7.1 The attached patch enables Hadoop to work on PPC64. That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64. There are changes in 3 files: - hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE The patch has been built on branch-2.7 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8518) Patch for PPC64
Tony Reix created HDFS-8518: --- Summary: Patch for PPC64 Key: HDFS-8518 URL: https://issues.apache.org/jira/browse/HDFS-8518 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Environment: RHEL 7.1 /PPC64 Reporter: Tony Reix Fix For: 2.7.1 The attached patch enables Hadoop to work on PPC64. That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64. There are changes in 3 files: - hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE The patch has been built on branch-2.7 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8518) Patch for PPC64
[ https://issues.apache.org/jira/browse/HDFS-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Reix resolved HDFS-8518. - Resolution: Invalid Patch for PPC64 --- Key: HDFS-8518 URL: https://issues.apache.org/jira/browse/HDFS-8518 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Environment: RHEL 7.1 /PPC64 Reporter: Tony Reix Fix For: 2.7.1 The attached patch enables Hadoop to work on PPC64. That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64. There are changes in 3 files: - hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE The patch has been built on branch-2.7 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8519) Patch for PPC64
[ https://issues.apache.org/jira/browse/HDFS-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Reix resolved HDFS-8519. - Resolution: Invalid Patch for PPC64 --- Key: HDFS-8519 URL: https://issues.apache.org/jira/browse/HDFS-8519 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Environment: RHEL 7.1 /PPC64 Reporter: Tony Reix Fix For: 2.7.1 The attached patch enables Hadoop to work on PPC64. That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64. There are changes in 3 files: - hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java - hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE The patch has been built on branch-2.7 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8506) List of 33 Unstable tests on branch-2.7
Tony Reix created HDFS-8506: --- Summary: List of 33 Unstable tests on branch-2.7 Key: HDFS-8506 URL: https://issues.apache.org/jira/browse/HDFS-8506 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.7.1 Environment: Ubuntu / x86_64 Hadoop branch-2.7 (source code of Monday May 26th) Reporter: Tony Reix On my Ubuntu / x86_64 machine, configured for Hadoop since months, I've run Hadoop tests of branch branch-2.7 (source code of Monday May 26th) during days. It produced 14 runs in the EXACT same environment. And it shows that several tests sometimes fail, randomly. 12 runs gave the exact same number of tests done and tests skipped: - 10977 tests - 254 skipped 1 test gave only 10972 tests. Another gave only 9760 tests. I've used the 12 runs with 10977 tests for building the attached result file, which shows that 33 tests sometimes fail. T: Tests F: Failures E: Errors S: Skipped NN/n : Number of times the issue appeared out of the 12 runs m-M: minimum number of failure up to Maximum number of failures. Example: T F E S | NN/n -- cli.TestHDFSCLI10-1 0 0 | 11/12 hdfs.TestAppendSnapshotTruncate 1 00-1 0 | 1/12 ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-6608) FsDatasetCache: hard-coded 4096 value in test is not appropriate for all HW
Tony Reix created HDFS-6608: --- Summary: FsDatasetCache: hard-coded 4096 value in test is not appropriate for all HW Key: HDFS-6608 URL: https://issues.apache.org/jira/browse/HDFS-6608 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 3.0.0 Environment: PPC64 (LE BE, OpenJDK IBM JVM, Ubuntu, RHEL 7 RHEL 6.5) Reporter: Tony Reix The value 4096 is hard-coded in HDFS code (product and tests). It appears 171 times, including 8 times in product (not tests) code: hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs : 163 hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs : 4 hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http : 3 hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/lib/wsrs : 1 This value deals with different subjects: files, block size, page size, etc. 4096 (as block size and page size) is appropriate for many systems, but not for PPC64, for which it is 65536. Looking at HDFS product (not test) code, it seems (no 100% sure) that the code is OK (not using hard-coded page/block size). However someone should check this in depth. his.maxBytes = dataset.datanode.getDnConf().getMaxLockedMemory(); However, at test level, the value 4096 is used in many places and it is very hard to understand if it depends on the HW architecture or not. About test TestFsDatasetCache#testPageRounder, the HW value is sometimes got from the system : private static final long PAGE_SIZE = NativeIO.POSIX.getCacheManipulator().getOperatingSystemPageSize(); private static final long BLOCK_SIZE = PAGE_SIZE; but there are several places where 4096 is used whenever it should depend on the HW value. conf.setLong(DFSConfigKeys.DFS_DATANODE_MAX_LOCKED_MEMORY_KEY, CACHE_CAPACITY); With: // Most Linux installs allow a default of 64KB locked memory private static final long CACHE_CAPACITY = 64 * 1024 However, for PPC64, this value should be much bigger. This TestFsDatasetCache#testPageRounder test is aimed to cache 5 pages of size 512. However, the page size is 65536 on PPC64 and 4064 on x86_64. Thus, the method in charge of reserving blocks in the HDFS cache will by 4096 bytes steps on x86_64 and 65536 bytes steps on PPC64 , whith a hard-coded limit : maxBytes = 65536 bytes 5 * 4096 = 20480 : OK 5 * 65536 = 327680 : KO : the test ends by TimeOut since the limit is overpassed at the very beginning and the test is still waiting. As a conclusion, there are several issues to fix: - instead of using many hard-coded values 4096, the (test mainly) code should use Java constants built by using HW values (like : NativeIO.POSIX.getCacheManipulator().getOperatingSystemPageSize() ) - several constants must be used since 4096 deals with different subjects, included some that do not depend on the HW - the test must be improved for handling cases where the limit is over-passed at the very beginning -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6515) testPageRounder (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache)
Tony Reix created HDFS-6515: --- Summary: testPageRounder (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) Key: HDFS-6515 URL: https://issues.apache.org/jira/browse/HDFS-6515 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.4.0 Environment: Linux on PPC64 Reporter: Tony Reix Priority: Blocker I have an issue with test : testPageRounder (org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) on Linux/PowerPC. On Linux/Intel, test runs fine. On Linux/PowerPC, I have: testPageRounder(org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache) Time elapsed: 64.037 sec ERROR! java.lang.Exception: test timed out after 6 milliseconds Looking at details, I see that some Failed to cache messages appear in the traces. Only 10 on Intel, but 186 on PPC64. On PPC64, it looks like some thread is waiting for something that never happens, generating a TimeOut. I'm now using IBM JVM, however I've just checked that the issue also appears with OpenJDK. I'm now using Hadoop latest, however, the issue appeared within Hadoop 2.4.0 . I need help for understanding what the test is doing, what traces are expected, in order to understand what/where is the root cause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6311) TestLargeBlock#testLargeBlockSize : File /tmp/TestLargeBlock/2147484160.dat could only be replicated to 0 nodes instead of minReplication (=1)
Tony Reix created HDFS-6311: --- Summary: TestLargeBlock#testLargeBlockSize : File /tmp/TestLargeBlock/2147484160.dat could only be replicated to 0 nodes instead of minReplication (=1) Key: HDFS-6311 URL: https://issues.apache.org/jira/browse/HDFS-6311 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.4.0 Environment: Virtual Box - Ubuntu 14.04 - x86_64 Reporter: Tony Reix I'm testing HDFS 2.4.0 Apache Hadoop HDFS: Tests run: 2650, Failures: 2, Errors: 2, Skipped: 99 I have the following error each time I launch my tests (3 tries). Forking command line: /bin/sh -c cd /home/tony/HADOOP/hadoop-2.4.0-src/hadoop-hdfs-project/hadoop-hdfs /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError -jar /home/tony/HADOOP/hadoop-2.4.0-src/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefirebooter2355654085353142996.jar /home/tony/HADOOP/hadoop-2.4.0-src/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire983005167523288650tmp /home/tony/HADOOP/hadoop-2.4.0-src/hadoop-hdfs-project/hadoop-hdfs/target/surefire/surefire_4328161716955453811297tmp Running org.apache.hadoop.hdfs.TestLargeBlock Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.011 sec FAILURE! - in org.apache.hadoop.hdfs.TestLargeBlock testLargeBlockSize(org.apache.hadoop.hdfs.TestLargeBlock) Time elapsed: 15.549 sec ERROR! org.apache.hadoop.ipc.RemoteException: File /tmp/TestLargeBlock/2147484160.dat could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2008) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy16.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy16.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:361) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1439) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1261) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525) -- This message was sent by Atlassian JIRA (v6.2#6252)