[jira] [Commented] (HDFS-14820) The default 8KB buffer of BlockReaderRemote#newBlockReader#BufferedOutputStream is too big
[ https://issues.apache.org/jira/browse/HDFS-14820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980612#comment-16980612 ] Eric Yang commented on HDFS-14820: -- Read buffer size is usually slightly smaller than MTU size. 8k buffer works well for jumbo frame network, which is a must-have for 10GB+ network cards. Most system default to use MTU size of 1500, maybe a default value close to 1400 is more sensible without suddenly slow down the default speed to 1/3 of what it is capable by default network. > The default 8KB buffer of > BlockReaderRemote#newBlockReader#BufferedOutputStream is too big > --- > > Key: HDFS-14820 > URL: https://issues.apache.org/jira/browse/HDFS-14820 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14820.001.patch, HDFS-14820.002.patch, > HDFS-14820.003.patch > > > this issue is similar to HDFS-14535. > {code:java} > public static BlockReader newBlockReader(String file, > ExtendedBlock block, > Token blockToken, > long startOffset, long len, > boolean verifyChecksum, > String clientName, > Peer peer, DatanodeID datanodeID, > PeerCache peerCache, > CachingStrategy cachingStrategy, > int networkDistance) throws IOException { > // in and out will be closed when sock is closed (by the caller) > final DataOutputStream out = new DataOutputStream(new BufferedOutputStream( > peer.getOutputStream())); > new Sender(out).readBlock(block, blockToken, clientName, startOffset, len, > verifyChecksum, cachingStrategy); > } > public BufferedOutputStream(OutputStream out) { > this(out, 8192); > } > {code} > Sender#readBlock parameter( block,blockToken, clientName, startOffset, len, > verifyChecksum, cachingStrategy) could not use such a big buffer. > So i think it should reduce BufferedOutputStream buffer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDFS-14730: - Fix Version/s: 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this to trunk. Thank you [~zhangchen]. > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961343#comment-16961343 ] Eric Yang commented on HDFS-14730: -- +1 for patch 002. Will commit to trunk if no objections. [~zhangchen] Thank you for the patch. > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1701) Move dockerbin script to libexec
[ https://issues.apache.org/jira/browse/HDDS-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957965#comment-16957965 ] Eric Yang commented on HDDS-1701: - [~cxorm] We can not call it bin/docker due to prohibited trademark use in [Docker Inc trademark guideline|https://www.docker.com/legal/trademark-guidelines]. The scripts are used for docker image start up or making configuration conversion during bootstrap. They are referenced by Dockerfile. There is no need to make another script to call scripts in libexec. > Move dockerbin script to libexec > > > Key: HDDS-1701 > URL: https://issues.apache.org/jira/browse/HDDS-1701 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: YiSheng Lien >Priority: Major > > Ozone tarball structure contains a new bin script directory called dockerbin. > These utility script can be relocated to OZONE_HOME/libexec because they are > internal binaries that are not intended to be executed directly by users or > shell scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent
[ https://issues.apache.org/jira/browse/HDDS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956312#comment-16956312 ] Eric Yang commented on HDDS-1847: - [~chris.t...@gmail.com] Hadoop 3.3.0+ has changed back to use hadoop.http.authentication.kerberos.keytab for securing HTTP protocol with Kerberos. Hadoop unified the SPNEGO settings to make sure that all HTTP ports are secured by one global setting. Ozone is departing from Hadoop, hence, some changes may not apply where other changes are worth considering. There are three usability improvements that might help Ozone Kerberos configuration to be easier to use. This ticket is focusing on three problems in Ozone Kerberos config names: 1. Datanode keytab files and principal names are inconsistent. SPNEGO files are prefixed with hdds, but Ozone datanodes are still using dfs prefix. It maybe useful to separate out Ozone deployed datanode config from HDFS to prevent confusion. 2. Datanode SPNEGO keytab file name is suffixed with keytab (look like Hadoop convention, but other Ozone processes are not suffixed with keytab.file.) 3. Should all SPNEGO keytab file uses the same prefix like Hadoop to prevent programming errors? > Datanode Kerberos principal and keytab config key looks inconsistent > > > Key: HDDS-1847 > URL: https://issues.apache.org/jira/browse/HDDS-1847 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Eric Yang >Assignee: Chris Teoh >Priority: Major > Labels: newbie > > Ozone Kerberos configuration can be very confusing: > | config name | Description | > | hdds.scm.kerberos.principal | SCM service principal | > | hdds.scm.kerberos.keytab.file | SCM service keytab file | > | ozone.om.kerberos.principal | Ozone Manager service principal | > | ozone.om.kerberos.keytab.file | Ozone Manager keytab file | > | hdds.scm.http.kerberos.principal | SCM service spnego principal | > | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file | > | ozone.om.http.kerberos.principal | Ozone Manager spnego principal | > | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file | > | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file | > | hdds.datanode.http.kerberos.principal | Datanode spnego principal | > | dfs.datanode.kerberos.principal | Datanode service principal | > | dfs.datanode.keytab.file | Datanode service keytab file | > The prefix are very different for each of the datanode configuration. It > would be nice to have some consistency for datanode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944755#comment-16944755 ] Eric Yang commented on HDFS-2470: - [~weichiu] You might need to backport HDFS-14890, if you intend to apply this patch to branch-3.1. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron Myers >Assignee: Siddharth Wagle >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.4 > > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, > HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, > HDFS-2470.09.patch, HDFS-2470.branch-3.1.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14890) Setting permissions on name directory fails on non posix compliant filesystems
[ https://issues.apache.org/jira/browse/HDFS-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944688#comment-16944688 ] Eric Yang commented on HDFS-14890: -- Thank you [~swagle] for the patch. Thank you [~elgoiri] [~hirik] for the reviews. I just committed this to trunk and branch-3.2. > Setting permissions on name directory fails on non posix compliant filesystems > -- > > Key: HDFS-14890 > URL: https://issues.apache.org/jira/browse/HDFS-14890 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: Windows 10. >Reporter: hirik >Assignee: Siddharth Wagle >Priority: Blocker > Fix For: 3.3.0, 3.2.2 > > Attachments: HDFS-14890.01.patch > > > Hi, > HDFS NameNode and JournalNode are not starting in Windows machine. Found > below related exception in logs. > Caused by: java.lang.UnsupportedOperationExceptionCaused by: > java.lang.UnsupportedOperationException > at java.base/java.nio.file.Files.setPosixFilePermissions(Files.java:2155) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613) > at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:422) > at > com.slog.dfs.hdfs.nn.NameNodeServiceImpl.delayedStart(NameNodeServiceImpl.java:147) > > Code changes related to this issue: > [https://github.com/apache/hadoop/commit/07e3cf952eac9e47e7bd5e195b0f9fc28c468313#diff-1a56e69d50f21b059637cfcbf1d23f11] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14890) Setting permissions on name directory fails on non posix compliant filesystems
[ https://issues.apache.org/jira/browse/HDFS-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDFS-14890: - Fix Version/s: 3.2.2 3.3.0 Hadoop Flags: Reviewed Release Note: - Fixed namenode/journal startup on Windows. Target Version/s: 3.3.0, 3.2.2 Resolution: Fixed Status: Resolved (was: Patch Available) > Setting permissions on name directory fails on non posix compliant filesystems > -- > > Key: HDFS-14890 > URL: https://issues.apache.org/jira/browse/HDFS-14890 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: Windows 10. >Reporter: hirik >Assignee: Siddharth Wagle >Priority: Blocker > Fix For: 3.3.0, 3.2.2 > > Attachments: HDFS-14890.01.patch > > > Hi, > HDFS NameNode and JournalNode are not starting in Windows machine. Found > below related exception in logs. > Caused by: java.lang.UnsupportedOperationExceptionCaused by: > java.lang.UnsupportedOperationException > at java.base/java.nio.file.Files.setPosixFilePermissions(Files.java:2155) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613) > at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:422) > at > com.slog.dfs.hdfs.nn.NameNodeServiceImpl.delayedStart(NameNodeServiceImpl.java:147) > > Code changes related to this issue: > [https://github.com/apache/hadoop/commit/07e3cf952eac9e47e7bd5e195b0f9fc28c468313#diff-1a56e69d50f21b059637cfcbf1d23f11] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14890) HDFS NameNode and JournalNode are not starting in Windows
[ https://issues.apache.org/jira/browse/HDFS-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944679#comment-16944679 ] Eric Yang commented on HDFS-14890: -- Update title to reflect the original issue. +1 for patch 01 for addressing regression from HDFS-2470. > HDFS NameNode and JournalNode are not starting in Windows > - > > Key: HDFS-14890 > URL: https://issues.apache.org/jira/browse/HDFS-14890 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: Windows 10. >Reporter: hirik >Assignee: Siddharth Wagle >Priority: Blocker > Attachments: HDFS-14890.01.patch > > > Hi, > HDFS NameNode and JournalNode are not starting in Windows machine. Found > below related exception in logs. > Caused by: java.lang.UnsupportedOperationExceptionCaused by: > java.lang.UnsupportedOperationException > at java.base/java.nio.file.Files.setPosixFilePermissions(Files.java:2155) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613) > at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:422) > at > com.slog.dfs.hdfs.nn.NameNodeServiceImpl.delayedStart(NameNodeServiceImpl.java:147) > > Code changes related to this issue: > [https://github.com/apache/hadoop/commit/07e3cf952eac9e47e7bd5e195b0f9fc28c468313#diff-1a56e69d50f21b059637cfcbf1d23f11] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14890) HDFS NameNode and JournalNode are not starting in Windows
[ https://issues.apache.org/jira/browse/HDFS-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDFS-14890: - Summary: HDFS NameNode and JournalNode are not starting in Windows (was: HDFS is not starting in Windows) > HDFS NameNode and JournalNode are not starting in Windows > - > > Key: HDFS-14890 > URL: https://issues.apache.org/jira/browse/HDFS-14890 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: Windows 10. >Reporter: hirik >Assignee: Siddharth Wagle >Priority: Blocker > Attachments: HDFS-14890.01.patch > > > Hi, > HDFS NameNode and JournalNode are not starting in Windows machine. Found > below related exception in logs. > Caused by: java.lang.UnsupportedOperationExceptionCaused by: > java.lang.UnsupportedOperationException > at java.base/java.nio.file.Files.setPosixFilePermissions(Files.java:2155) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591) > at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613) > at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206) > at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:422) > at > com.slog.dfs.hdfs.nn.NameNodeServiceImpl.delayedStart(NameNodeServiceImpl.java:147) > > Code changes related to this issue: > [https://github.com/apache/hadoop/commit/07e3cf952eac9e47e7bd5e195b0f9fc28c468313#diff-1a56e69d50f21b059637cfcbf1d23f11] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936213#comment-16936213 ] Eric Yang commented on HDFS-14845: -- [~Prabhu Joseph] Thank you for the patch. Patch 004 looks good to me. [~aajisaka] let us know if this looks good on your end. Thanks > Request is a replay (34) error in httpfs > > > Key: HDFS-14845 > URL: https://issues.apache.org/jira/browse/HDFS-14845 > Project: Hadoop HDFS > Issue Type: Bug > Components: httpfs >Affects Versions: 3.3.0 > Environment: Kerberos and ZKDelgationTokenSecretManager enabled in > HttpFS >Reporter: Akira Ajisaka >Assignee: Prabhu Joseph >Priority: Critical > Attachments: HDFS-14845-001.patch, HDFS-14845-002.patch, > HDFS-14845-003.patch, HDFS-14845-004.patch > > > We are facing "Request is a replay (34)" error when accessing to HDFS via > httpfs on trunk. > {noformat} > % curl -i --negotiate -u : "https://:4443/webhdfs/v1/?op=liststatus" > HTTP/1.1 401 Authentication required > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 271 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)) > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > (snip) > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 413 > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /webhdfs/v1/. Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936045#comment-16936045 ] Eric Yang commented on HDFS-14461: -- [~hexiaoqiao] Thank you for the patch. Patch 005 looks good to me. +1 > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, > HDFS-14461.003.patch, HDFS-14461.004.patch, HDFS-14461.005.patch > > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: org.apache.hadoop.security.KerberosAuthException: failure to > login: for principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab >
[jira] [Comment Edited] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934589#comment-16934589 ] Eric Yang edited comment on HDFS-14845 at 9/20/19 5:17 PM: --- [~Prabhu Joseph] Thank you for the patch. I tested with these sets of configuration and both can work as long as I define hadoop.http.authentication.signature.secret.file. {code} hadoop.http.authentication.type kerberos hadoop.http.authentication.kerberos.principal HTTP/host1.example@example.com hadoop.http.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab hadoop.http.authentication.signature.secret.file ${httpfs.config.dir}/httpfs-signature.secret hadoop.http.filter.initializers org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer hadoop.authentication.type kerberos httpfs.hadoop.authentication.type kerberos httpfs.hadoop.authentication.kerberos.principal nn/host1.example@example.com httpfs.hadoop.authentication.kerberos.keytab /etc/security/keytabs/hdfs.service.keytab {code} Backward compatible config also works: {code} hadoop.http.authentication.type kerberos httpfs.authentication.signature.secret.file ${httpfs.config.dir}/httpfs-signature.secret hadoop.http.filter.initializers org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer httpfs.authentication.type kerberos httpfs.hadoop.authentication.type kerberos httpfs.authentication.kerberos.principal HTTP/host-1.example@example.com httpfs.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab httpfs.hadoop.authentication.kerberos.principal nn/host-1.example@example.com httpfs.hadoop.authentication.kerberos.keytab /etc/security/keytabs/hdfs.service.keytab {code} When httpfs.authentication.signature.secret.file is undefined in httpfs-site.xml, httpfs server doesn't work. {code} Exception in thread "main" java.io.IOException: Unable to initialize WebAppContext at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:1198) at org.apache.hadoop.fs.http.server.HttpFSServerWebServer.start(HttpFSServerWebServer.java:154) at org.apache.hadoop.fs.http.server.HttpFSServerWebServer.main(HttpFSServerWebServer.java:187) Caused by: java.lang.RuntimeException: Undefined property: signature.secret.file at org.apache.hadoop.fs.http.server.HttpFSAuthenticationFilter.getConfiguration(HttpFSAuthenticationFilter.java:95) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.init(AuthenticationFilter.java:160) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.init(DelegationTokenAuthenticationFilter.java:180) at org.eclipse.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349) at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1406) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1368) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778) at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:522) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:113) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131) at org.eclipse.jetty.server.Server.start(Server.java:427) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) at org.eclipse.jetty.server.Server.doStart(Server.java:394)
[jira] [Commented] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934589#comment-16934589 ] Eric Yang commented on HDFS-14845: -- [~Prabhu Joseph] Thank you for the patch. I tested with these sets of configuration and both can work as long as I define hadoop.http.authentication.signature.secret.file. {code} hadoop.http.authentication.type kerberos hadoop.http.authentication.kerberos.principal HTTP/host1.example@example.com hadoop.http.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab hadoop.http.authentication.signature.secret.file ${httpfs.config.dir}/httpfs-signature.secret hadoop.http.filter.initializers org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer hadoop.authentication.type kerberos httpfs.hadoop.authentication.type kerberos httpfs.hadoop.authentication.kerberos.principal nn/host1.example@example.com httpfs.hadoop.authentication.kerberos.keytab /etc/security/keytabs/hdfs.service.keytab {code} Backward compatible config also works: {code} hadoop.http.authentication.type kerberos httpfs.authentication.signature.secret.file ${httpfs.config.dir}/httpfs-signature.secret hadoop.http.filter.initializers org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer httpfs.authentication.type kerberos httpfs.hadoop.authentication.type kerberos httpfs.authentication.kerberos.principal HTTP/host-1.example@example.com httpfs.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab httpfs.hadoop.authentication.kerberos.principal nn/host-1.example@example.com httpfs.hadoop.authentication.kerberos.keytab /etc/security/keytabs/hdfs.service.keytab {code} When httpfs.authentication.signature.secret.file is undefined in httpfs-site.xml, httpfs server doesn't work. {code} Exception in thread "main" java.io.IOException: Unable to initialize WebAppContext at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:1198) at org.apache.hadoop.fs.http.server.HttpFSServerWebServer.start(HttpFSServerWebServer.java:154) at org.apache.hadoop.fs.http.server.HttpFSServerWebServer.main(HttpFSServerWebServer.java:187) Caused by: java.lang.RuntimeException: Undefined property: signature.secret.file at org.apache.hadoop.fs.http.server.HttpFSAuthenticationFilter.getConfiguration(HttpFSAuthenticationFilter.java:95) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.init(AuthenticationFilter.java:160) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.init(DelegationTokenAuthenticationFilter.java:180) at org.eclipse.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349) at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1406) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1368) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778) at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:522) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:113) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131) at org.eclipse.jetty.server.Server.start(Server.java:427) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:61) at org.eclipse.jetty.server.Server.doStart(Server.java:394) at
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931942#comment-16931942 ] Eric Yang commented on HDFS-14609: -- It would be great to have a follow up for HDFS-14461. I have reservation on giving +1 because I don't have full visibility if the both patches would be doing the right thing together. Tentatively +1. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, > HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, > HDFS-14609.006.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931795#comment-16931795 ] Eric Yang commented on HDFS-14845: -- [~Prabhu Joseph] Thank you for patch 002. {quote}But most of the testcases related to HttpFSServerWebServer (eg: TestHttpFSServer) requires more changes as they did not use HttpServer2 and so the filter initializers are not called, instead it uses a Test Jetty Server with HttpFSServerWebApp which are failing as the filter won't have any configs. Please let me know if we can handle this in a separate improvement Jira.{quote} All HttpFS unit tests are passing on my system. Which test requires a separate ticket? {quote}Have changed the HttpFSAuthenticationFilter$getConfiguration to honor the hadoop.http.authentication configs which will be overridden by httpfs.authentication configs.{quote} Patch 2 works for these configuration: {code} hadoop.http.authentication.type kerberos hadoop.http.authentication.kerberos.principal HTTP/host-1.example@example.com hadoop.http.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab hadoop.http.filter.initializers org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilterInitializer,org.apache.hadoop.security.HttpCrossOriginFilterInitializer httpfs.authentication.type kerberos hadoop.authentication.type kerberos httpfs.hadoop.authentication.type kerberos httpfs.authentication.kerberos.principal HTTP/host-1.example@example.com httpfs.authentication.kerberos.keytab /etc/security/keytabs/spnego.service.keytab httpfs.hadoop.authentication.kerberos.principal nn/host-1.example@example.com httpfs.hadoop.authentication.kerberos.keytab /etc/security/keytabs/hdfs.service.keytab {code} It doesn't work when configuration skips httpfs.hadoop.authentication.type, httpfs.authentication.kerberos.keytab and httpfs.hadoop.authentication.kerberos.principal. httpfs server doesn't start when these config are missing. I think some logic to map the configuration are missing in patch 002. > Request is a replay (34) error in httpfs > > > Key: HDFS-14845 > URL: https://issues.apache.org/jira/browse/HDFS-14845 > Project: Hadoop HDFS > Issue Type: Bug > Components: httpfs >Affects Versions: 3.3.0 > Environment: Kerberos and ZKDelgationTokenSecretManager enabled in > HttpFS >Reporter: Akira Ajisaka >Assignee: Prabhu Joseph >Priority: Critical > Attachments: HDFS-14845-001.patch, HDFS-14845-002.patch > > > We are facing "Request is a replay (34)" error when accessing to HDFS via > httpfs on trunk. > {noformat} > % curl -i --negotiate -u : "https://:4443/webhdfs/v1/?op=liststatus" > HTTP/1.1 401 Authentication required > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 271 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)) > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > (snip) > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 413 > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /webhdfs/v1/. Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929793#comment-16929793 ] Eric Yang commented on HDFS-14845: -- [~Prabhu Joseph] Would it be possible that HttpFSAuthenticationFilter is only a parameter passing filter to trigger filter initialization like ProxyUserAuthenticationFilterInitializer, and the internally route all doGet, doPost methods to the initialized filter? 1. If httpfs.authentication.* are not defined, then fall back to the default behavior to be consistent with hadoop.http.authentication.type. 2. This provides appearance that if httpfs.authentication.type is configured to use custom filter, the system will respond consistently with rest of the Hadoop web end points. 3. If httpfs.authentication.type=kerberos, HttpFSAuthenticationFilter is a combo of Kerberos + DelegationToken + Proxy support. > Request is a replay (34) error in httpfs > > > Key: HDFS-14845 > URL: https://issues.apache.org/jira/browse/HDFS-14845 > Project: Hadoop HDFS > Issue Type: Bug > Components: httpfs >Affects Versions: 3.3.0 > Environment: Kerberos and ZKDelgationTokenSecretManager enabled in > HttpFS >Reporter: Akira Ajisaka >Assignee: Prabhu Joseph >Priority: Critical > Attachments: HDFS-14845-001.patch > > > We are facing "Request is a replay (34)" error when accessing to HDFS via > httpfs on trunk. > {noformat} > % curl -i --negotiate -u : "https://:4443/webhdfs/v1/?op=liststatus" > HTTP/1.1 401 Authentication required > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 271 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)) > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > (snip) > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 413 > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /webhdfs/v1/. Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14845) Request is a replay (34) error in httpfs
[ https://issues.apache.org/jira/browse/HDFS-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929312#comment-16929312 ] Eric Yang commented on HDFS-14845: -- [~Prabhu Joseph] Thank you for the patch. Correct me if I am mistaken. This patch will restore HttpFSAuthenticationFilter instead of enforcing custom filters. When user select to use JWTAuthenticationFilter, it would not apply to HttpFS server. This may not meet user expectation. The more proper solution is to nullified HttpFSAuthenticationFilter, and map authFilter initialization to the standard filter initializer only. [~aajisaka] Let us know if you really intend to use JWTAuthenticationFilter with HttpFS, or you only want HttpFSAuthenticationFilter? Thanks > Request is a replay (34) error in httpfs > > > Key: HDFS-14845 > URL: https://issues.apache.org/jira/browse/HDFS-14845 > Project: Hadoop HDFS > Issue Type: Bug > Components: httpfs >Affects Versions: 3.3.0 > Environment: Kerberos and ZKDelgationTokenSecretManager enabled in > HttpFS >Reporter: Akira Ajisaka >Assignee: Prabhu Joseph >Priority: Critical > Attachments: HDFS-14845-001.patch > > > We are facing "Request is a replay (34)" error when accessing to HDFS via > httpfs on trunk. > {noformat} > % curl -i --negotiate -u : "https://:4443/webhdfs/v1/?op=liststatus" > HTTP/1.1 401 Authentication required > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > WWW-Authenticate: Negotiate > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 271 > HTTP/1.1 403 GSSException: Failure unspecified at GSS-API level (Mechanism > level: Request is a replay (34)) > Date: Mon, 09 Sep 2019 06:00:04 GMT > Date: Mon, 09 Sep 2019 06:00:04 GMT > Pragma: no-cache > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > (snip) > Set-Cookie: hadoop.auth=; Path=/; Secure; HttpOnly > Cache-Control: must-revalidate,no-cache,no-store > Content-Type: text/html;charset=iso-8859-1 > Content-Length: 413 > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /webhdfs/v1/. Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927797#comment-16927797 ] Eric Yang commented on HDFS-14609: -- [~zhangchen] Thank you for the patch. Once the check style issue is fixed. The rest looks good to me. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, > HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921541#comment-16921541 ] Eric Yang commented on HDDS-1554: - [~arp] closer examination shows that: {code} mvn -T 1C clean install -DskipTests=true -Pdist -Dtar -DskipShade -Pit,docker-build -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT {code} This does not work because skipTests flag is set. {code} mvn test -Pit -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT {code} This also doesn't work because the tests are written for integration-test phase. By running test phase only, it does not trigger integration tests to run. The proper command to run, looks like any of the following examples: {code} mvn clean install -Pit,docker-build mvn verify -Pit -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT {code} Hope this clarifies the usage of maven commands for these integration tests. If the commands are too cumbersome, we can remove "it" profile. I prefer to avoid docker-build, or docker.image parameters. They are mandatory today because the dist module supports three ways of using docker images. Hence it is necessary to drive from the top level to instruct which image to use. > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch, HDDS-1554.014.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919754#comment-16919754 ] Eric Yang commented on HDFS-14461: -- [~ayushtkn] Some test enhancement is happening in HDFS-14609. The pre-requisites need to be addressed, then hadoop rbf project can tackle rbf Kerberos related test cases. > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch > > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: org.apache.hadoop.security.KerberosAuthException: failure to > login: for principal: router/localh...@example.com from keytab >
[jira] [Comment Edited] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918264#comment-16918264 ] Eric Yang edited comment on HDDS-1554 at 8/29/19 4:04 AM: -- [~arp] The test is written to run by specifying the "it" profile. {code} mvn -T 1C clean install -DskipTests=true -Pdist -Dtar -DskipShade -Pit,docker-build -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT{code} was (Author: eyang): [~arp] The test is written to run by specifying the "it" profile. {code} mvn -T 1C clean install -DskipTests=true -Pdist -Dtar -DskipShade -P,itdocker-build -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT{code} > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch, HDDS-1554.014.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918264#comment-16918264 ] Eric Yang commented on HDDS-1554: - [~arp] The test is written to run by specifying the "it" profile. {code} mvn -T 1C clean install -DskipTests=true -Pdist -Dtar -DskipShade -P,itdocker-build -Ddocker.image=apache/ozone:0.5.0-SNAPSHOT{code} > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch, HDDS-1554.014.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916891#comment-16916891 ] Eric Yang commented on HDFS-2470: - [~arp] Sorry for the confusion. The patch is fine. Further analysis revealed that my cluster automation script was flawed. The patch is working fine with HBase. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, > HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, HDFS-2470.09.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916216#comment-16916216 ] Eric Yang commented on HDFS-2470: - [~swagle] Thank you for patch 09, unfortunately, this patch breaks HBase for some reason. HBase does not show exact error, but fail to start HBase Region server. It appears that there is an exception thrown, but the error menifested in HBase as ZooKeeper ACL exception: {code} 2019-08-26 14:45:42,597 WARN [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] client.ZooKeeperSaslClient: Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit ' (where is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t ' (where is the name of the Kerberos principal, and is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock. 2019-08-26 14:45:42,598 WARN [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: SASL configuration failed: javax.security.auth.login.LoginException: No password provided Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-08-26 14:45:42,598 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server eyang-4.vpc.cloudera.com/10.65.53.170:2181 2019-08-26 14:45:42,598 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Socket connection established to eyang-4.vpc.cloudera.com/10.65.53.170:2181, initiating session 2019-08-26 14:45:42,601 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Session establishment complete on server eyang-4.vpc.cloudera.com/10.65.53.170:2181, sessionid = 0x200010a127c0070, negotiated timeout = 6 2019-08-26 14:45:45,659 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] ipc.RpcServer: Stopping server on 16020 2019-08-26 14:45:45,659 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] token.AuthenticationTokenSecretManager: Stopping leader election, because: SecretManager stopping 2019-08-26 14:45:45,660 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping 2019-08-26 14:45:45,660 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped 2019-08-26 14:45:45,660 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping 2019-08-26 14:45:45,660 FATAL [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] regionserver.HRegionServer: ABORTING region server eyang-3.vpc.cloudera.com,16020,1566855941147: Initialization of RS failed. Hence aborting RS. java.io.IOException: Received the shutdown message while waiting. at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:819) at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:772) at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:744) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:889) at java.lang.Thread.run(Thread.java:748) {code} When the patch is removed, HBase was able to start successfully. I dig pretty deep in HBase source code, but StorageDirectory is not used in the code base. I am validated that the Datanode directory default permission doesn't change by patch 09. More studies is required to understand the root cause of the incompatibility. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee:
[jira] [Comment Edited] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916216#comment-16916216 ] Eric Yang edited comment on HDFS-2470 at 8/26/19 11:15 PM: --- [~swagle] Thank you for patch 09, unfortunately, this patch breaks HBase for some reason. HBase does not show exact error, but fail to start HBase Region server. It appears that there is an exception thrown, but the error menifested in HBase as ZooKeeper ACL exception: {code} 2019-08-26 14:45:42,597 WARN [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] client.ZooKeeperSaslClient: Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit ' (where is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t ' (where is the name of the Kerberos principal, and is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock. 2019-08-26 14:45:42,598 WARN [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: SASL configuration failed: javax.security.auth.login.LoginException: No password provided Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-08-26 14:45:42,598 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Opening socket connection to server eyang-4.vpc.cloudera.com/10.65.53.170:2181 2019-08-26 14:45:42,598 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Socket connection established to eyang-4.vpc.cloudera.com/10.65.53.170:2181, initiating session 2019-08-26 14:45:42,601 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020-SendThread(eyang-4.vpc.cloudera.com:2181)] zookeeper.ClientCnxn: Session establishment complete on server eyang-4.vpc.cloudera.com/10.65.53.170:2181, sessionid = 0x200010a127c0070, negotiated timeout = 6 2019-08-26 14:45:45,659 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] ipc.RpcServer: Stopping server on 16020 2019-08-26 14:45:45,659 INFO [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] token.AuthenticationTokenSecretManager: Stopping leader election, because: SecretManager stopping 2019-08-26 14:45:45,660 INFO [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping 2019-08-26 14:45:45,660 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped 2019-08-26 14:45:45,660 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping 2019-08-26 14:45:45,660 FATAL [regionserver/eyang-3.vpc.cloudera.com/10.65.52.68:16020] regionserver.HRegionServer: ABORTING region server eyang-3.vpc.cloudera.com,16020,1566855941147: Initialization of RS failed. Hence aborting RS. java.io.IOException: Received the shutdown message while waiting. at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:819) at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:772) at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:744) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:889) at java.lang.Thread.run(Thread.java:748) {code} When the patch is removed, HBase was not able to start successfully. I dig pretty deep in HBase source code, but StorageDirectory is not used in the code base. I am validated that the Datanode directory default permission doesn't change by patch 09. More studies is required to understand the root cause of the incompatibility. was (Author: eyang): [~swagle] Thank you for patch 09, unfortunately, this patch breaks HBase for some reason. HBase does not show exact error, but fail to start HBase Region server. It appears that there is an exception thrown, but the error menifested in HBase as ZooKeeper ACL exception: {code} 2019-08-26 14:45:42,597 WARN
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911779#comment-16911779 ] Eric Yang commented on HDFS-2470: - [~swagle] Thank you for the patch. Patch 08 looks good to me. Pending Jenkins results. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, > HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908481#comment-16908481 ] Eric Yang commented on HDFS-2470: - [~swagle] Thank you for patch 07. I think the right fix is not setting root directory permission. HDFS is smart about creating subdirectory from root directory. The root directory is a system admin defined location or the provision system should initialize the directory properly with proper ownership and permission. Without setting root directory permission, the solution is more generic that works for both /tmp or /tmp/namenode. Would it be safer to pass in a default permission of 0700 instead of null for the constructors that did not accept permission parameter? In the past, the files and directories are created based on user umask. This cause all files to be readable by anyone on standard Linux installation. For HDFS, ihdfs user would want to keep all data private, unless explicitly required by very old version of short circuit read. Hence, it might be useful to pass default permission and we can skip the null check to ensure data are secured by default unless explicitly allowed. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, > HDFS-2470.06.patch, HDFS-2470.07.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14375) DataNode cannot serve BlockPool to multiple NameNodes in the different realm
[ https://issues.apache.org/jira/browse/HDFS-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908389#comment-16908389 ] Eric Yang commented on HDFS-14375: -- [~Jihyun.Cho] The first log line indicates that ipc Server authenticated dn/testhost1@test1.com to access Datanode in dn/testhost1@test2.com. The problem is the second log line in SecurityAuthorizationManager. It looks like a wrong optimization that happened long time ago [on this line|https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/authorize/ServiceAuthorizationManager.java#L120]. The original code was comparing [short username|https://github.com/apache/hadoop/commit/c3fdd289cf26fa3bb9c0d2d9f906eba769ddd789#diff-90193e5349be2122d5ed915ba38c957dL123]. The original code ensures dn/testhost1@test1.com and dn/testhost2@test2.com can both map to the same user in auth_to_local rules. The current implementation compares the raw principals, which skips auth_to_local rule mapping and fail authorization incorrectly. > DataNode cannot serve BlockPool to multiple NameNodes in the different realm > > > Key: HDFS-14375 > URL: https://issues.apache.org/jira/browse/HDFS-14375 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 3.1.1 >Reporter: Jihyun Cho >Assignee: Jihyun Cho >Priority: Major > Attachments: authorize.patch > > > Let me explain the environment for a description. > {noformat} > KDC(TEST1.COM) <-- Cross-realm trust --> KDC(TEST2.COM) >| | > NameNode1 NameNode2 >| | >-- DataNodes (federated) -- > {noformat} > We configured the secure clusters and federated them. > * Principal > ** NameNode1 : nn/_h...@test1.com > ** NameNode2 : nn/_h...@test2.com > ** DataNodes : dn/_h...@test2.com > But DataNodes could not connect to NameNode1 with below error. > {noformat} > WARN > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization failed for dn/hadoop-datanode.test@test2.com > (auth:KERBEROS) for protocol=interface > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol: this service is only > accessible by dn/hadoop-datanode.test@test1.com > {noformat} > We have avoided the error with attached patch. > The patch checks only using {{username}} and {{hostname}} except {{realm}}. > I think there is no problem. Because if realms are different and no > cross-realm setting, they cannot communication each other. If you are worried > about this, please let me know. > In the long run, it would be better if I could set multiple realms for > authorize. Like this; > {noformat} > > dfs.namenode.kerberos.trust-realms > TEST1.COM,TEST2.COM > > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907425#comment-16907425 ] Eric Yang commented on HDFS-2470: - [~swagle] . Default to 700 is generally a good idea. StorageDirectory is also used by datanode, and there is a [legacy version of HDFS Short circuit|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html] read that allows datanode storage directory to be controlled using dfs.datanode.data.dir.perm config. By using 700 default, it may create incompatible change to application that depends on legacy HDFS short circuit read defaults. Directory permission can default to 700 after the code logic checks against all permissions config with dirType to ensure we don't regress. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, HDFS-2470.06.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906523#comment-16906523 ] Eric Yang commented on HDFS-2470: - Thank you [~swagle] for the patch. Namenode directory is created with 700 permssion, however, I think there are still bugs in the implementation. A few questions about patch 006: in hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/common/Storage.java: {code} + if (permission != null) { +Set permissions = EnumSet.of(OWNER_READ, +OWNER_WRITE, OWNER_EXECUTE); +Files.setPosixFilePermissions(root.toPath(), permissions); +Files.setPosixFilePermissions(curDir.toPath(), permissions); + } {code} # It looks like if permission variable is passed in, it will use hard coded "permissions". This logic doesn't seem right. I think you want to map the numeric values of permission variable to PosixFilePermission posix enums. # Can we avoid passing null as parameters to StorageDirectory method? If it has not been defined, would it be possible to compute the default permission (dfs.*.storage.dir.perm) from dirType? > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, HDFS-2470.06.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14375) DataNode cannot serve BlockPool to multiple NameNodes in the different realm
[ https://issues.apache.org/jira/browse/HDFS-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905573#comment-16905573 ] Eric Yang commented on HDFS-14375: -- {quote}I think the main issue is DataNode only authorize its own realm, even if the realms are set cross-realm trust. To solve this issue, clientPrincipal should be checked multiple cross-realms in authorize method. {quote} Authorize method is looking into [krbInfo to find the hostname from the service principal to find a match|https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/authorize/ServiceAuthorizationManager.java#L109]. If client access datanode and passed authentication negotiation, client ticket cache will have datanode hostname in ticket cache. Hadoop code does not inspect realm part of the principal name in authorize method, but merely validate that client ticket cache contains the hostname name of datanode. One way to validate that cross-realm authentication is to look at klist output and make sure that: {code:java} klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hdfs-d...@example.com Valid starting Expires Service principal 08/12/2019 19:28:17 08/13/2019 19:28:17 krbtgt/example@example.com renew until 08/19/2019 19:28:17 08/12/2019 20:37:49 08/13/2019 19:28:17 HTTP/datanode.example2@example2.com renew until 08/19/2019 19:28:17 {code} In this example, ticket cache contains user's own krbtgt and also granted service principal for host in a different realm. > DataNode cannot serve BlockPool to multiple NameNodes in the different realm > > > Key: HDFS-14375 > URL: https://issues.apache.org/jira/browse/HDFS-14375 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 3.1.1 >Reporter: Jihyun Cho >Assignee: Jihyun Cho >Priority: Major > Attachments: authorize.patch > > > Let me explain the environment for a description. > {noformat} > KDC(TEST1.COM) <-- Cross-realm trust --> KDC(TEST2.COM) >| | > NameNode1 NameNode2 >| | >-- DataNodes (federated) -- > {noformat} > We configured the secure clusters and federated them. > * Principal > ** NameNode1 : nn/_h...@test1.com > ** NameNode2 : nn/_h...@test2.com > ** DataNodes : dn/_h...@test2.com > But DataNodes could not connect to NameNode1 with below error. > {noformat} > WARN > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization failed for dn/hadoop-datanode.test@test2.com > (auth:KERBEROS) for protocol=interface > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol: this service is only > accessible by dn/hadoop-datanode.test@test1.com > {noformat} > We have avoided the error with attached patch. > The patch checks only using {{username}} and {{hostname}} except {{realm}}. > I think there is no problem. Because if realms are different and no > cross-realm setting, they cannot communication each other. If you are worried > about this, please let me know. > In the long run, it would be better if I could set multiple realms for > authorize. Like this; > {noformat} > > dfs.namenode.kerberos.trust-realms > TEST1.COM,TEST2.COM > > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14375) DataNode cannot serve BlockPool to multiple NameNodes in the different realm
[ https://issues.apache.org/jira/browse/HDFS-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904232#comment-16904232 ] Eric Yang commented on HDFS-14375: -- This looks like a configuration issue in KDC server to perform cross realm trust. Please verify that krbtgt/test1@test2.com principal has been added for cross realm trust to work and vis-vera for bi-directional trust. You will also need to make sure Hadoop's auth_to_local would map the remote realm to the same dn user. UserGroupInformation.getShortName() should be invoked to resolve user name instead of manually parsing principal name. Otherwise, auth_to_local rules are skipped and losing hierarchical information often result in privileges escalation security holes. > DataNode cannot serve BlockPool to multiple NameNodes in the different realm > > > Key: HDFS-14375 > URL: https://issues.apache.org/jira/browse/HDFS-14375 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 3.1.1 >Reporter: Jihyun Cho >Assignee: Jihyun Cho >Priority: Major > Attachments: authorize.patch > > > Let me explain the environment for a description. > {noformat} > KDC(TEST1.COM) <-- Cross-realm trust --> KDC(TEST2.COM) >| | > NameNode1 NameNode2 >| | >-- DataNodes (federated) -- > {noformat} > We configured the secure clusters and federated them. > * Principal > ** NameNode1 : nn/_h...@test1.com > ** NameNode2 : nn/_h...@test2.com > ** DataNodes : dn/_h...@test2.com > But DataNodes could not connect to NameNode1 with below error. > {noformat} > WARN > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization failed for dn/hadoop-datanode.test@test2.com > (auth:KERBEROS) for protocol=interface > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol: this service is only > accessible by dn/hadoop-datanode.test@test1.com > {noformat} > We have avoided the error with attached patch. > The patch checks only using {{username}} and {{hostname}} except {{realm}}. > I think there is no problem. Because if realms are different and no > cross-realm setting, they cannot communication each other. If you are worried > about this, please let me know. > In the long run, it would be better if I could set multiple realms for > authorize. Like this; > {noformat} > > dfs.namenode.kerberos.trust-realms > TEST1.COM,TEST2.COM > > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1554: Attachment: HDDS-1554.014.patch > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch, HDDS-1554.014.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904077#comment-16904077 ] Eric Yang commented on HDDS-1554: - [~arp] Thank you for the review. {quote}ITDiskReadOnly#testReadOnlyDiskStartup - The following block of code can probably be removed, since it's really testing that the cluster is read-only in safe mode. We have unit tests for that: {quote} Correct me, if I am wrong. The tests are not exactly the same. This test is triggering validation from Ozone client point of view. The unit test for TestVolumeSet#testFailedVolume is written for the server side. The smoke test tests the positive test case to ensure volume can be created, but not when disk is in read-only mode. I think there is value in test client side response to ensure we have better coverage. Thought? {quote}ITDiskReadOnly#testUpload - do we need to wait for safe mode exit after restarting the cluster? Also I think this test is essentially the same as the previous one. {quote} Safe mode validation is skipped here because Ozone exits on read-only disk. The extra wait time only adds formality for wait time. In reality, it would be better to keep Ozone daemon running, but keep the file system in safe mode or degraded mode that prevents write operations. This would be useful for disaster recovery that System admin may want to prevent further damage to disk but intend to recover data from Ozone buckets. This test is designed to pass for running in read-only mode, and exit strategy mode. Both design are validate. Test is more useful, if Ozone daemons don't exit on read-only disk. I intend to add a download test for ITDiskReadOnly as well, if read-only mode can be implemented. {quote}ITDiskCorruption#addCorruption:72 - looks like we have a hard-coded path. Should we get from configuration instead? {quote} Thank you for the suggestion. I made adjustment to ensure maven project build directory can be customized in patch 014. The test is using ${buildDirectory}/data/meta to store metadata, which defaults to maven ${project.build.directory}. It will corrupt the data file. Placing the data file in maven build directory is a good way to ensure that mvn clean will reset the state of the data file cleanly. When this is configured externally, then external mechanism must be developed to reset the data file state. {quote}ITDiskCorruption#testUpload - The corruption implementation is bit of a heavy hammer, it is replacing the content of all meta files. Is it possible to make it reflect real-world corruption where a part of the file may be corrupted. Also we should probably restart the cluster after corrupting RocksDB meta files. {quote} If Ozone is restarted after metadata corruption, it will fall into the same code path that unable to open rocksdb and fail to start. This will make corruption upload test to execution the same code path as ITDiskReadOnly#testReadOnlyDiskStartupp. The test would have no purpose. The test is purposefully corrupting metadata files without restart. This is to ensure safety mechanism will be built to protect metadata integrity. One possible design is to have background thread that check for rocksdb health. In the test, we can shorten the interval of the check to almost immediate, to verify that upload would not be successful when metadata corruption happens, and Ozone protect further corruption by entering safe mode or degraded mode. {quote}ITDiskCorruption#testDownload:161 - should we just remove the assertTrue since it is no-op? {quote} The intend is to ensure IOException is throw for the test assertion to pass. It is better written for clarity: {code:java} Assert.assertTrue("Download File test passed.", e instanceof IOException); {code} Patch 014 also includes the improved assertTrue statements. > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 -
[jira] [Commented] (HDDS-1554) Create disk tests for fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903436#comment-16903436 ] Eric Yang commented on HDDS-1554: - [~arp] The tests are written to run in integration phase, try: {code} mvn verify -Pit,docker-build {code} > Create disk tests for fault injection test > -- > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch, > HDDS-1554.012.patch, HDDS-1554.013.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir
[ https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900435#comment-16900435 ] Eric Yang commented on HDFS-2470: - [~swagle] Thank you for the patch. 1. File API is ridden with misbehavior for serious filesystem work. For creating directory and setting file permission recursively for the newly created directories, use [Files|https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html] api instead. {code} import static java.nio.file.attribute.PosixFilePermission.OWNER_READ; import static java.nio.file.attribute.PosixFilePermission.OWNER_WRITE; ... Set permissions = EnumSet.of(OWNER_READ, OWNER_WRITE); Files.createDirectory(Paths.get(curDir), PosixFilePermissions.asFileAttribute(permissions)); {code} I am not sure about setting root permission of the working directory. It could be that /tmp/namenode, and accidentally make /tmp read/write only by hdfs user and fail. 2. javax.annotation.Nullable is a problematic annotation. Findbugs uses this annotation but it will prevent code from working in JDK 9 to work with signed content. See HADOOP-16463 for detail. It would be nice to use findbugsExcludeFile.xml to define the variable is nullable. > NN should automatically set permissions on dfs.namenode.*.dir > - > > Key: HDFS-2470 > URL: https://issues.apache.org/jira/browse/HDFS-2470 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Aaron T. Myers >Assignee: Siddharth Wagle >Priority: Major > Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, > HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch > > > Much as the DN currently sets the correct permissions for the > dfs.datanode.data.dir, the NN should do the same for the > dfs.namenode.(name|edit).dir. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896267#comment-16896267 ] Eric Yang commented on HDFS-14461: -- [~hexiaoqiao] The test cases fail on my system same as Jenkins reported. Please make sure that your .m2 maven cache is cleared to ensure your test results are accurate. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal tests for invalid Spnego principal setup by unsetting dfs.web.authentication.kerberos.keytab configuration. The test case can be updated to looking for hadoop.http.authentication.kerberos.principal because SecurityConfUtil has been updated to use the globally consistent configuration for referencing Spnego keytab setup. TestRouterFaultTolerant#testWriteWithFailedSubcluster also failed because the test case is written for simple security. SecurityConfUtil will turn on SPNEGO authentication to http protocol when this patch is applied. This caused client to unable to talk to namenode to get block locations if client does not send Authentication negotiation header. > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch > > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at >
[jira] [Updated] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1833: Resolution: Fixed Fix Version/s: 0.5.0 Status: Resolved (was: Patch Available) > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Fix For: 0.5.0 > > Attachments: HDDS-1833.01.patch, HDDS-1833.02.patch, > HDDS-1833.03.patch, HDDS-1833.04.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895395#comment-16895395 ] Eric Yang commented on HDDS-1833: - +1 Thank you [~swagle] for the patch. Patch 004 looks good to me. Committing shortly. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch, HDDS-1833.02.patch, > HDDS-1833.03.patch, HDDS-1833.04.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894172#comment-16894172 ] Eric Yang commented on HDDS-1833: - [~swagle] Sorry, I don't think that is true. >From [Java >spec|https://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.12.4]: {quote} Example 15.12.4.1-2. Evaluation Order During Method Invocation As part of an instance method invocation (§15.12), there is an expression that denotes the object to be invoked. This expression appears to be fully evaluated before any part of any argument expression to the method invocation is evaluated.{quote} ExceptionUtils.getStackTrace() is fully evaluated before trace method invocation. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch, HDDS-1833.02.patch, > HDDS-1833.03.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13734) Add Heapsize variables for HDFS daemons
[ https://issues.apache.org/jira/browse/HDFS-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894069#comment-16894069 ] Eric Yang commented on HDFS-13734: -- [~bdscheller] Sorry, I agree with [~aw]. HDFS_*_OPTS is preferred for a number of reasons, like -Xms setting and gc policy flags etc. By giving -Xmx flag without optimize other flag may create other problems for novice users and complicates config management. YARN_*_HEAPSIZE are not good examples to follow. > Add Heapsize variables for HDFS daemons > --- > > Key: HDFS-13734 > URL: https://issues.apache.org/jira/browse/HDFS-13734 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, journal-node, namenode >Affects Versions: 3.0.3 >Reporter: Brandon Scheller >Priority: Major > > Currently there are no variables to set HDFS daemon heapsize differently. > While still possible through adding the -Xmx to HDFS_*DAEMON*_OPTS, this is > not intuitive for this relatively common setting. > YARN currently has these separate YARN_*DAEMON*_HEAPSIZE variables supported > so it seems natural for HDFS too. > It also looks like HDFS use to have this for namenode with > HADOOP_NAMENODE_INIT_HEAPSIZE > This JIRA is to have these configurations added/supported -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893294#comment-16893294 ] Eric Yang edited comment on HDFS-14461 at 7/26/19 3:43 AM: --- [~elgoiri] I think it is premature to start using PR. I have outlined a number of short coming using PR in the dev mailing list. We may want to wait for some of the out standing issues to close before recommending PR. [~hexiaoqiao] {quote} 1. is there any other way to wait keys until persisted rather than Thread.sleep(1000)? {quote} {code} while (!file.exists()) {} {code} But it would be nicer, if you do it the way that [~crh] suggested. {quote} 2. do we need to define configuration item `hadoop.http.authentication.*` at CommonConfigurationKeys? {quote} I think it goes to: CommonConfigurationKeysPublic.java. Some keys are already there. I think it's nice but optional. {quote} 3. I am confused how TestRouterHttpDelegationToken test passed when pre-commit HDFS-13972? {quote} I think it was testing with anonymous allowed which implicitly passed through, but I can't be sure. {quote} 4. it seems that NoAuthFilter is not effective anymore, and I try to delete it. {quote} Ok was (Author: eyang): [~elgoiri] I think it is premature to start using PR. I have outlined a number of short coming using PR in the dev mailing list. We may want to wait for some of the out standing issues to close before recommending PR. [~hexiaoqiao] {quote} 1. is there any other way to wait keys until persisted rather than Thread.sleep(1000)? {quote} {code} while (!file.exists()) {} {code} {quote} 2. do we need to define configuration item `hadoop.http.authentication.*` at CommonConfigurationKeys? {quote} I think it goes to: CommonConfigurationKeysPublic.java. Some keys are already there. I think it's nice but optional. {quote} 3. I am confused how TestRouterHttpDelegationToken test passed when pre-commit HDFS-13972? {quote} I think it was testing with anonymous allowed which implicitly passed through, but I can't be sure. {quote} 4. it seems that NoAuthFilter is not effective anymore, and I try to delete it. {quote} Ok > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > Attachments: HDFS-14461.001.patch > > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at >
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893294#comment-16893294 ] Eric Yang commented on HDFS-14461: -- [~elgoiri] I think it is premature to start using PR. I have outlined a number of short coming using PR in the dev mailing list. We may want to wait for some of the out standing issues to close before recommending PR. [~hexiaoqiao] {quote} 1. is there any other way to wait keys until persisted rather than Thread.sleep(1000)? {quote} {code} while (!file.exists()) {} {code} {quote} 2. do we need to define configuration item `hadoop.http.authentication.*` at CommonConfigurationKeys? {quote} I think it goes to: CommonConfigurationKeysPublic.java. Some keys are already there. I think it's nice but optional. {quote} 3. I am confused how TestRouterHttpDelegationToken test passed when pre-commit HDFS-13972? {quote} I think it was testing with anonymous allowed which implicitly passed through, but I can't be sure. {quote} 4. it seems that NoAuthFilter is not effective anymore, and I try to delete it. {quote} Ok > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > Attachments: HDFS-14461.001.patch > > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at >
[jira] [Commented] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893208#comment-16893208 ] Eric Yang commented on HDDS-1833: - [~swagle] Thank you for the patch. Generating the full stack may take more compute cycles. If this is a frequently called API, I would recommend to keep the if statement to perform stack trace computation only when trace is turned on. If this is not a frequently called API, the patch 3 looks good to me. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch, HDDS-1833.02.patch, > HDDS-1833.03.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892989#comment-16892989 ] Eric Yang commented on HDDS-1833: - [~swagle] Thank you for patch 002. Unfortunately, this doesn't quite work as I intended. Sorry for the mislead in option 1. The output may turns out to be: {code} java.lang.Thread.getStackTrace(Thread.java:1559) {code} h3. Option 2 This will give one liner output of the current stack: {code} LOG.trace("DecRef {} to refCnt {}, stackTrace: {}", containerDBPath, referenceCount.get(), new Throwable().getStackTrace()); {code} Output: {code} DecRef /test to 0, stackTrace: org/apache/hadoop/ozone/container/common/utils/ReferenceCountedDB. decrementReference(ReferenceCountedDB.java:64) {code} h3. Option 3 If you want a full stack trace, solution is: {code} import org.apache.commons.lang3.exception.ExceptionUtils; ... LOG.trace("DecRef {} to refCnt {}, stackTrace: {}", containerDBPath, referenceCount.get(), ExceptionUtils.getStackTrace(new Throwable()); {code} Option 2 is better to know where the line of code originated. Option 3 is useful to list the full stack. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch, HDDS-1833.02.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-1857) YARN fails on mapreduce in Kerberos enabled cluster
Eric Yang created HDDS-1857: --- Summary: YARN fails on mapreduce in Kerberos enabled cluster Key: HDDS-1857 URL: https://issues.apache.org/jira/browse/HDDS-1857 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Eric Yang When configured Ozone as secure cluster, running mapreduce job on secure YARN produces this error message: {code} 2019-07-23 19:33:12,168 INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.io.IOException: DestHost:destPort eyang-1.openstacklocal:9862 , LocalHost:localPort eyang-1.openstacklocal/172.26.111.17:0. Failed on local exception: java.io.IOException: Couldn't set up IO streams: java.util.ServiceConfigurationError: org.apache.hadoop.security.SecurityInfo: Provider org.apache.hadoop.yarn.server.RMNMSecurityInfoClass not a subtype, while invoking $Proxy13.submitRequest over nodeId=null,nodeAddress=eyang-1.openstacklocal:9862 after 9 failover attempts. Trying to failover immediately. 2019-07-23 19:33:12,174 ERROR ha.OMFailoverProxyProvider: Failed to connect to OM. Attempted 10 retries and 10 failovers 2019-07-23 19:33:12,176 ERROR client.OzoneClientFactory: Couldn't create protocol class org.apache.hadoop.ozone.client.rpc.RpcClient exception: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ozone.client.OzoneClientFactory.getClientProtocol(OzoneClientFactory.java:291) at org.apache.hadoop.ozone.client.OzoneClientFactory.getRpcClient(OzoneClientFactory.java:169) at org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl.(BasicOzoneClientAdapterImpl.java:137) at org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl.(BasicOzoneClientAdapterImpl.java:101) at org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl.(BasicOzoneClientAdapterImpl.java:86) at org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl.(OzoneClientAdapterImpl.java:34) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.fs.ozone.OzoneClientAdapterFactory.lambda$createAdapter$1(OzoneClientAdapterFactory.java:66) at org.apache.hadoop.fs.ozone.OzoneClientAdapterFactory.createAdapter(OzoneClientAdapterFactory.java:116) at org.apache.hadoop.fs.ozone.OzoneClientAdapterFactory.createAdapter(OzoneClientAdapterFactory.java:62) at org.apache.hadoop.fs.ozone.OzoneFileSystem.createAdapter(OzoneFileSystem.java:98) at org.apache.hadoop.fs.ozone.BasicOzoneFileSystem.initialize(BasicOzoneFileSystem.java:144) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3338) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3387) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3355) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:497) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:245) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:352) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:250) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:233) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:104) at org.apache.hadoop.fs.shell.Command.run(Command.java:177) at org.apache.hadoop.fs.FsShell.run(FsShell.java:327) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:390) Caused by: java.io.IOException: DestHost:destPort eyang-1.openstacklocal:9862 , LocalHost:localPort eyang-1.openstacklocal/172.26.111.17:0. Failed on local exception: java.io.IOException: Couldn't set up IO streams: java.util.ServiceConfigurationError: org.apache.hadoop.security.SecurityInfo: Provider org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.security.LocalizerSecurityInfo not a subtype at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
[jira] [Commented] (HDDS-1094) Performance testing infrastructure : Special handling for zero-filled chunks on the Datanode
[ https://issues.apache.org/jira/browse/HDDS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892115#comment-16892115 ] Eric Yang commented on HDDS-1094: - HDDS-1772 implements some tests that fill up datanode disk that might be useful here. > Performance testing infrastructure : Special handling for zero-filled chunks > on the Datanode > > > Key: HDDS-1094 > URL: https://issues.apache.org/jira/browse/HDDS-1094 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Reporter: Supratim Deka >Priority: Major > > Goal: > Make Ozone chunk Read/Write operations CPU/network bound for specially > constructed performance micro benchmarks. > Remove disk bandwidth and latency constraints - running ozone data path > against extreme low-latency & high throughput storage will expose performance > bottlenecks in the flow. But low-latency storage(NVME flash drives, Storage > class memory etc) is expensive and availability is limited. Is there a > workaround which achieves similar running conditions for the software without > actually having the low latency storage? At least for specially constructed > datasets - for example zero-filled blocks (*not* zero-length blocks). > Required characteristics of the solution: > No changes in Ozone client, OM and SCM. Changes limited to Datanode, Minimal > footprint in datanode code. > Possible High level Approach: > The ChunkManager and ChunkUtils can enable writeChunk for zero-filled chunks > to be dropped without actually writing to the local filesystem. Similarly, if > readChunk can construct a zero-filled buffer without reading from the local > filesystem whenever it detects a zero-filled chunk. Specifics of how to > detect and record a zero-filled chunk can be discussed on this jira. Also > discuss how to control this behaviour and make it available only for internal > testing. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891993#comment-16891993 ] Eric Yang edited comment on HDDS-1833 at 7/24/19 5:16 PM: -- We usually don't need conditional check unless the log statement contains expensive computation before it can be printed. In this case, we are simply concatenating two strings. By using LOG.trace statement only and let logger suppress the output maybe good enough unless we want to include stacktrace. new Exception().printStackTrace(); will print the stacktrace to .out file. This might result in same related log statement sent to .out file and .log file. This could be hard to correlate. In general Hadoop does not log anything to .out file other than showing the current ulimit configuration. I would suggest to use slf4j feature to render stacktrace in the same log statement. {code} if (LOG.isTraceEnabled()) { LOG.trace("DecRef {} to refCnt {}, stacktrace: {}", containerDBPath, referenceCount.get(), Thread.currentThread().getStackTrace()); } {code} was (Author: eyang): We usually don't need conditional check unless the log statement contains expensive computation before it can be printed. In this case, we are simply concatenating two strings. By using LOG.trace statement only and let logger suppress the output is good enough. new Exception().printStackTrace(); will print the stacktrace to .out file. This means we have two duplicated output in the logs as well as stdout which is converted to .out log file. In general Hadoop does not log anything to .out file other than showing the current ulimit configuration. I would suggest to skip new Exception().printStackTrace() statement all together. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1833) RefCountedDB printing of stacktrace should be moved to trace logging
[ https://issues.apache.org/jira/browse/HDDS-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891993#comment-16891993 ] Eric Yang commented on HDDS-1833: - We usually don't need conditional check unless the log statement contains expensive computation before it can be printed. In this case, we are simply concatenating two strings. By using LOG.trace statement only and let logger suppress the output is good enough. new Exception().printStackTrace(); will print the stacktrace to .out file. This means we have two duplicated output in the logs as well as stdout which is converted to .out log file. In general Hadoop does not log anything to .out file other than showing the current ulimit configuration. I would suggest to skip new Exception().printStackTrace() statement all together. > RefCountedDB printing of stacktrace should be moved to trace logging > > > Key: HDDS-1833 > URL: https://issues.apache.org/jira/browse/HDDS-1833 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Affects Versions: 0.4.0 >Reporter: Mukul Kumar Singh >Assignee: Siddharth Wagle >Priority: Major > Labels: newbie > Attachments: HDDS-1833.01.patch > > > RefCountedDB logs the stackTrace for both increment and decrement, this > pollutes the logs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-1847) Datanode Kerberos principal and keytab config key looks inconsistent
Eric Yang created HDDS-1847: --- Summary: Datanode Kerberos principal and keytab config key looks inconsistent Key: HDDS-1847 URL: https://issues.apache.org/jira/browse/HDDS-1847 Project: Hadoop Distributed Data Store Issue Type: Bug Affects Versions: 0.5.0 Reporter: Eric Yang Ozone Kerberos configuration can be very confusing: | config name | Description | | hdds.scm.kerberos.principal | SCM service principal | | hdds.scm.kerberos.keytab.file | SCM service keytab file | | ozone.om.kerberos.principal | Ozone Manager service principal | | ozone.om.kerberos.keytab.file | Ozone Manager keytab file | | hdds.scm.http.kerberos.principal | SCM service spnego principal | | hdds.scm.http.kerberos.keytab.file | SCM service spnego keytab file | | ozone.om.http.kerberos.principal | Ozone Manager spnego principal | | ozone.om.http.kerberos.keytab.file | Ozone Manager spnego keytab file | | hdds.datanode.http.kerberos.keytab | Datanode spnego keytab file | | hdds.datanode.http.kerberos.principal | Datanode spnego principal | | dfs.datanode.kerberos.principal | Datanode service principal | | dfs.datanode.keytab.file | Datanode service keytab file | The prefix are very different for each of the datanode configuration. It would be nice to have some consistency for datanode. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891178#comment-16891178 ] Eric Yang edited comment on HDFS-14461 at 7/23/19 4:14 PM: --- [~hexiaoqiao] {quote}SecurityConfUtil#initSecurity does not set principal or keytab currently. I try to reference to corresponding SPNEGO principal and test.keytab and throws another exception as following,{quote} I think "Authentication required" is caused by the caller did not send the authentication header. {code} conf.set(DFSConfigKeys.DFS_WEBHDFS_AUTHENTICATION_FILTER_KEY, NoAuthFilter.class.getName()); {code} This above code is setting dfs.web.authentication.filter to no authentication filter. This is what turns off SPNEGO filter. You should configure it to use either AuthenticationFilter or ProxyUserAuthenticationFilter or AuthFilter to get proper SPNEGO setup. HADOOP-16314 and HADOOP-16354 are designed to inspect hadoop.http.filter.initializers and if AuthenticationFilter or ProxyUserAuthenticationFilter is set in the config. It will switch to use AuthFilter because HDFS uses AuthFilter to issue delegation token. You were closer to getting successful authentication when you get Authentication required. The caller side must send a valid SPNEGO negotiation header that looks like this: {code} Authenticate: Negotiate [base64 hex string of user tgt] {code} Example code for generating the token for kerberos authentication negotiate header is available in hadoop-common TestKerberosAuthenticationHandler#testRequestWithAuthorization test case. Please make sure both server side and client side configuration have Kerberos turned on, otherwise client may not send the required header for authentication. was (Author: eyang): [~hexiaoqiao] {quote}SecurityConfUtil#initSecurity does not set principal or keytab currently. I try to reference to corresponding SPNEGO principal and test.keytab and throws another exception as following,{quote} I think "Authentication required" is caused by the caller did not send the authentication header. {code} conf.set(DFSConfigKeys.DFS_WEBHDFS_AUTHENTICATION_FILTER_KEY, NoAuthFilter.class.getName()); {code} This above code is setting dfs.web.authentication.filter to no authentication filter. This is what turns off SPNEGO filter. You should configure it to use either AuthenticationFilter or ProxyUserAuthenticationFilter or AuthFilter to get proper SPNEGO setup. HADOOP-16314 and HADOOP-16354 are designed to inspect hadoop.http.filter.initializers and if AuthenticationFilter or ProxyUserAuthenticationFilter is set in the config. It will switch to use AuthFilter because HDFS uses AuthFilter to issue delegation token. You were closer to getting successful authentication when you get Authentication required. The caller side must send a valid SPNEGO negotiation header that looks like this: {code} Authorization: Negotiate [base64 hex string of user tgt] {code} Example code for generating the token for kerberos authentication negotiate header is available in hadoop-common TestKerberosAuthenticationHandler#testRequestWithAuthorization test case. Please make sure both server side and client side configuration have Kerberos turned on, otherwise client may not send the required header for authentication. > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891178#comment-16891178 ] Eric Yang commented on HDFS-14461: -- [~hexiaoqiao] {quote}SecurityConfUtil#initSecurity does not set principal or keytab currently. I try to reference to corresponding SPNEGO principal and test.keytab and throws another exception as following,{quote} I think "Authentication required" is caused by the caller did not send the authentication header. {code} conf.set(DFSConfigKeys.DFS_WEBHDFS_AUTHENTICATION_FILTER_KEY, NoAuthFilter.class.getName()); {code} This above code is setting dfs.web.authentication.filter to no authentication filter. This is what turns off SPNEGO filter. You should configure it to use either AuthenticationFilter or ProxyUserAuthenticationFilter or AuthFilter to get proper SPNEGO setup. HADOOP-16314 and HADOOP-16354 are designed to inspect hadoop.http.filter.initializers and if AuthenticationFilter or ProxyUserAuthenticationFilter is set in the config. It will switch to use AuthFilter because HDFS uses AuthFilter to issue delegation token. You were closer to getting successful authentication when you get Authentication required. The caller side must send a valid SPNEGO negotiation header that looks like this: {code} Authorization: Negotiate [base64 hex string of user tgt] {code} Example code for generating the token for kerberos authentication negotiate header is available in hadoop-common TestKerberosAuthenticationHandler#testRequestWithAuthorization test case. Please make sure both server side and client side configuration have Kerberos turned on, otherwise client may not send the required header for authentication. > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: He Xiaoqiao >Priority: Major > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889399#comment-16889399 ] Eric Yang commented on HDDS-1712: - [~anu] {quote}Case in point when you told me that Ozone is full of findbugs issues and checkstyle issues. When I asked you to compare with Hadoop you ran away, because like this it was blatantly false.{quote} With regard to findbug issues, Hadoop does not require Findbugs jar file on the classpath at runtime. Most of Hadoop findbugs exclusion were to deal with Object serialization generated with protobuf codegen. The bugs flagged manually because of codegen and unfortunate compatibility reasons with keep up FSImage mutations. They are only used as last resort. Ozone uses annotation to suppress findbugs rather quickly and the bugs are not at the same level that is hard to solve in Hadoop. The usage is very different. Why having Findbugs on the classpath is not good? Findbugs depends on older XML parser, which has CVE vulnerabilities. If we don't need the jar file in the class, please remove it from runtime. It is hard to identify how people would misuse vulnerabilities when a collections of them are hidden in the software. Due diligence would help to keep security bugs down. I offered the patches, and Marton said it's good to fix them. Whether you accept or reject the patches is your choice. If you allow sudo in the container, you will only end up with more code that does remote root download and execution at runtime. This makes Ozone more unpredictable and dangerous. It will be hard to clean up later. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889280#comment-16889280 ] Eric Yang commented on HDDS-1712: - {quote}I am -1; on this patch and wasteful discussion. As I have clearly said many times; these are to be treated as examples and documentation, not as part of the product. Unless there is a change in that status, I am not willing to commit this patch.{quote} With all due respect, I can not agree on this is just examples and documentation. According [Alpha cluster|https://hadoop.apache.org/ozone/docs/0.4.0-alpha/runningviadocker.html] documentation, this is the first thing that you ask people to try. No matter if you try Ozone from binary, or building from source, in all paths, Ozone-runner image is used. Hence, there is no path that leads to avoid the vulnerable docker image according to Ozone website. Although there is a path to manually setup without running smoke test and use tarball binary, this path is not documented in any known material. Hence, this vulernable docker image puts everyone who tries Ozone at risk. [Security is mandatory|https://www.apache.org/foundation/how-it-works.html#philosophy] is one of Apache's guiding principal. Please be considerate for others at minimum fully document tarball instructions to avoid the mistake, or simply polish the code to a more presentable state before release. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1712: Status: Patch Available (was: Reopened) > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reopened HDDS-1712: - > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889252#comment-16889252 ] Eric Yang commented on HDDS-1712: - [~elek] HDDS-1712.001.hadoop-docker-ozone.patch and HDDS-1712.002.patch should remove sudo together to make Ozone-runner image less powerful. I can only get 33 out of 110 test case pass on my own test machine without the patch. When the patch is applied, the same result appears in smoke test report. I don't have s3 account to validate if s3 test cases would pass. Please help with the verification. Thanks > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1712: Attachment: HDDS-1712.002.patch > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1712: Attachment: HDDS-1712.001.hadoop-docker-ozone.patch > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.hadoop-docker-ozone.patch, > HDDS-1712.001.patch, HDDS-1712.002.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889200#comment-16889200 ] Eric Yang commented on HDDS-1773: - Patch 005 fixed some minor directory configuration error, and core-site.xml setup. > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch, > HDDS-1773.003.patch, HDDS-1773.004.patch, HDDS-1773.005.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1773: Attachment: HDDS-1773.005.patch > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch, > HDDS-1773.003.patch, HDDS-1773.004.patch, HDDS-1773.005.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1773: Attachment: HDDS-1773.004.patch > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch, > HDDS-1773.003.patch, HDDS-1773.004.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-1828) smoke test core-site.xml is confusing to user
[ https://issues.apache.org/jira/browse/HDDS-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned HDDS-1828: --- Assignee: Xiaoyu Yao > smoke test core-site.xml is confusing to user > - > > Key: HDDS-1828 > URL: https://issues.apache.org/jira/browse/HDDS-1828 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Xiaoyu Yao >Priority: Major > > In smoke test, Hadoop code is placed in /opt/hadoop, and Ozone code is placed > in /opt/ozone. > There are two copies of core-site.xml, one in $HADOOP_CONF_DIR, and another > one in $OZONE_CONF_DIR. When user look at the copy in $OZONE_CONF_DIR, > core-site.xml is empty. This may lead to assumption that hadoop is running > with local file system. Most application will reference to core-site.xml on > the classpath. Hence, it depends on how the application is carefully setup > to avoid using $OZONE_CONF_DIR as a single node Hadoop. It may make sense to > symlink $OZONE_CONF_DIR to $HADOOP_CONF_DIR to prevent mistakes. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-1828) smoke test core-site.xml is confusing to user
Eric Yang created HDDS-1828: --- Summary: smoke test core-site.xml is confusing to user Key: HDDS-1828 URL: https://issues.apache.org/jira/browse/HDDS-1828 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Eric Yang In smoke test, Hadoop code is placed in /opt/hadoop, and Ozone code is placed in /opt/ozone. There are two copies of core-site.xml, one in $HADOOP_CONF_DIR, and another one in $OZONE_CONF_DIR. When user look at the copy in $OZONE_CONF_DIR, core-site.xml is empty. This may lead to assumption that hadoop is running with local file system. Most application will reference to core-site.xml on the classpath. Hence, it depends on how the application is carefully setup to avoid using $OZONE_CONF_DIR as a single node Hadoop. It may make sense to symlink $OZONE_CONF_DIR to $HADOOP_CONF_DIR to prevent mistakes. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-1826) External Ozone client throws exception when accessing data in docker container
Eric Yang created HDDS-1826: --- Summary: External Ozone client throws exception when accessing data in docker container Key: HDDS-1826 URL: https://issues.apache.org/jira/browse/HDDS-1826 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Eric Yang External Ozone client has trouble accessing Ozone data hosted in docker container when data replication is set to 3. This RPC error message is thrown: {code} Error while calling command (org.apache.hadoop.ozone.web.ozShell.volume.CreateVolumeHandler@2903c6ff): org.apache.hadoop.ipc.RemoteException(com.google.protobuf.InvalidProtocolBufferException): While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length. at com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:70) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:728) at com.google.protobuf.CodedInputStream.readRawByte(CodedInputStream.java:769) at com.google.protobuf.CodedInputStream.readRawVarint32(CodedInputStream.java:378) at com.google.protobuf.CodedInputStream.readEnum(CodedInputStream.java:343) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneAclInfo.(OzoneManagerProtocolProtos.java:42318) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneAclInfo.(OzoneManagerProtocolProtos.java:42236) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneAclInfo$1.parsePartialFrom(OzoneManagerProtocolProtos.java:42366) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneAclInfo$1.parsePartialFrom(OzoneManagerProtocolProtos.java:42361) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$VolumeInfo.(OzoneManagerProtocolProtos.java:21457) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$VolumeInfo.(OzoneManagerProtocolProtos.java:21376) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$VolumeInfo$1.parsePartialFrom(OzoneManagerProtocolProtos.java:21501) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$VolumeInfo$1.parsePartialFrom(OzoneManagerProtocolProtos.java:21496) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$CreateVolumeRequest.(OzoneManagerProtocolProtos.java:23836) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$CreateVolumeRequest.(OzoneManagerProtocolProtos.java:23783) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$CreateVolumeRequest$1.parsePartialFrom(OzoneManagerProtocolProtos.java:23887) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$CreateVolumeRequest$1.parsePartialFrom(OzoneManagerProtocolProtos.java:23882) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest.(OzoneManagerProtocolProtos.java:2023) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest.(OzoneManagerProtocolProtos.java:1935) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest$1.parsePartialFrom(OzoneManagerProtocolProtos.java:2607) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OMRequest$1.parsePartialFrom(OzoneManagerProtocolProtos.java:2602) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.hadoop.ipc.RpcWritable$ProtobufWrapper.readFrom(RpcWritable.java:125) at org.apache.hadoop.ipc.RpcWritable$Buffer.getValue(RpcWritable.java:187) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:514) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) at
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888291#comment-16888291 ] Eric Yang commented on HDDS-1773: - [~elek] Patch 003 added a throttle-acid.sh for periodically throttle read IO of datanode container by throttling for 1 second, and unthrottle for 10 seconds. The README file contains instructions on how to use this script as privileged user. Does this work for you? > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch, > HDDS-1773.003.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1773: Attachment: HDDS-1773.003.patch > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch, > HDDS-1773.003.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888099#comment-16888099 ] Eric Yang commented on HDFS-14461: -- [~hexiaoqiao] welcome. > RBF: Fix intermittently failing kerberos related unit test > -- > > Key: HDFS-14461 > URL: https://issues.apache.org/jira/browse/HDFS-14461 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: CR Hota >Assignee: Fengnan Li >Priority: Major > > TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It > may be due to some race condition before using the keytab that's created for > testing. > > {code:java} > Failed > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken > Failing for the past 1 build (Since > [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! > #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] ) > [Took 89 > ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history] > > Error Message > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED > h3. Stacktrace > org.apache.hadoop.service.ServiceStateException: > org.apache.hadoop.security.KerberosAuthException: failure to login: for > principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: org.apache.hadoop.security.KerberosAuthException: failure to > login: for principal: router/localh...@example.com from keytab > /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab > javax.security.auth.login.LoginException: Integrity check on decrypted field > failed (31) - PREAUTH_FAILED at >
[jira] [Comment Edited] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888095#comment-16888095 ] Eric Yang edited comment on HDDS-1712 at 7/18/19 3:56 PM: -- [~elek] Your output seems to indicate multiple datanode pods. This looks different than what I would expected, shouldn't it look like this: {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 3/3 Running 0 11m om-0 1/1 Running 0 11m s3g-01/1 Running 0 11m scm-01/1 Running 0 11m {code} Where datanode-0 pod has 3 instances running? We can take this offline in HDDS-1825. However, I think it is not fair to ask removal of sudo patch to include a full working smoke test code on kubernetes cluster because the kubernetes cluster code is incomplete. I can include patch for smoke test to work with docker-compose cluster, if you are open to this. Thoughts? was (Author: eyang): [~elek] Your output seems to indicate multiple datanode pods. This looks different than what I would expected, shouldn't it look like this: {code} {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 3/3 Running 0 11m om-0 1/1 Running 0 11m s3g-01/1 Running 0 11m scm-01/1 Running 0 11m {code} Where datanode-0 pod has 3 instances running? We can take this offline in HDDS-1825. However, I think it is not fair to ask removal of sudo patch to include a full working smoke test code on kubernetes cluster because the kubernetes cluster code is incomplete. I can include patch for smoke test to work with docker-compose cluster, if you are open to this. Thoughts? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888095#comment-16888095 ] Eric Yang commented on HDDS-1712: - [~elek] Your output seems to indicate multiple datanode pods. This looks different than what I would expected, shouldn't it look like this: {code} {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 3/3 Running 0 11m om-0 1/1 Running 0 11m s3g-01/1 Running 0 11m scm-01/1 Running 0 11m {code} Where datanode-0 pod has 3 instances running? We can take this offline in HDDS-1825. However, I think it is not fair to ask removal of sudo patch to include a full working smoke test code on kubernetes cluster because the kubernetes cluster code is incomplete. I can include patch for smoke test to work with docker-compose cluster, if you are open to this. Thoughts? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1825) Kubernetes deployment starts only one data node by default
[ https://issues.apache.org/jira/browse/HDDS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888081#comment-16888081 ] Eric Yang commented on HDDS-1825: - I am not familiar with Kubernetes, but the expected output supposed to look like this: {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 3/3 Running 0 11m om-0 1/1 Running 0 11m s3g-01/1 Running 0 11m scm-01/1 Running 0 11m {code} > Kubernetes deployment starts only one data node by default > -- > > Key: HDDS-1825 > URL: https://issues.apache.org/jira/browse/HDDS-1825 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > > By following [Ozone > wiki|https://cwiki.apache.org/confluence/display/HADOOP/Deploy+Ozone+to+Kubernetes] > to deploy Ozone on Kubernetes, the default deployment result looks like this: > {code} > $ kubectl get pod > NAME READY STATUSRESTARTS AGE > datanode-0 0/1 Pending 0 11m > om-0 0/1 Pending 0 11m > s3g-00/1 Pending 0 11m > scm-00/1 Pending 0 11m > {code} > There should be three datanodes for Ozone to work. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-1825) Kubernetes deployment starts only one data node by default
Eric Yang created HDDS-1825: --- Summary: Kubernetes deployment starts only one data node by default Key: HDDS-1825 URL: https://issues.apache.org/jira/browse/HDDS-1825 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Eric Yang By following [Ozone wiki|https://cwiki.apache.org/confluence/display/HADOOP/Deploy+Ozone+to+Kubernetes] to deploy Ozone on Kubernetes, the default deployment result looks like this: {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 0/1 Pending 0 11m om-0 0/1 Pending 0 11m s3g-00/1 Pending 0 11m scm-00/1 Pending 0 11m {code} There should be three datanodes for Ozone to work. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887469#comment-16887469 ] Eric Yang commented on HDDS-1773: - [~elek] {quote}I am sorry to say, but I have different opinion (as I tried to explain earlier). Sometimes it's notable, sometimes it's not.{quote} Btw, cgroup can controll manually to have greater precision of number of IOs to commit to disk as a group operation. For example: {code} echo ": " > /cgrp/blkio.throttle.io_serviced {code} This throttle configuration can be changed base on time intervals to produce slow and intermittent IO at fixed interval. Would this work in your train of thoughts? > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887407#comment-16887407 ] Eric Yang commented on HDDS-1771: - {quote}Neither HDDS-1773 nor HDDS-1771 test the problems what I described here.{quote} I am sorry that problem statement in the test methodology was flawed. One slow IO read or write out of 100 is statically insignificant and masked by OS and application caches, which I have [explained in HDDS-1773 comment|https://issues.apache.org/jira/browse/HDDS-1773?focusedCommentId=16882206=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16882206]. If you are unable to see pass the flaws in the problem statement, then we have reach another impasse. {quote}I can execute it, but what do you expect? What is required for a green build? If the disk speed parameters are low the tests will be failed all time time. If they are high it will be passed.{quote} This test is a good way to find a default number of slowest disk that can be supported in the current code base. This default numbers may fail over time when Ozone code base becomes more complex. When that happens, we can identify what patch trigger the performance regression, and improve disk health calculation based on heuristics. I think there is some values in running these tests in nightly runs. > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887394#comment-16887394 ] Eric Yang commented on HDDS-1712: - {quote}Yes, AFAIK it's fine. If you have any error message, let me know. Happy to help. (But maybe not in this jira, but using the usual channels...) (I am just wondering: If you can't deploy, how do you know how does it work? How do you know if it's wrong...){quote} According to kubctl output, the pod configuration does not have 3 datanodes: {code} $ kubectl get pod NAME READY STATUSRESTARTS AGE datanode-0 0/1 Pending 0 11m om-0 0/1 Pending 0 11m s3g-00/1 Pending 0 11m scm-00/1 Pending 0 11m {code} ozone.replication is not set to 1, how does this work? Pod configuration in json format indicates there is no environment variables for CORE-SITE.XML. How does this work? {code} $ kubectl get pod -o json { "apiVersion": "v1", "items": [ { "apiVersion": "v1", "kind": "Pod", "metadata": { "annotations": { "prdatanodeetheus.io/path": "/prom", "prdatanodeetheus.io/port": "9882", "prdatanodeetheus.io/scrape": "true" }, "creationTimestamp": "2019-07-17T19:30:41Z", "generateName": "datanode-", "labels": { "app": "ozone", "component": "datanode", "controller-revision-hash": "datanode-5f4d6556b8", "statefulset.kubernetes.io/pod-name": "datanode-0" }, "name": "datanode-0", "namespace": "default", "ownerReferences": [ { "apiVersion": "apps/v1", "blockOwnerDeletion": true, "controller": true, "kind": "StatefulSet", "name": "datanode", "uid": "449168e5-c9b9-443c-b65b-475a97e64710" } ], "resourceVersion": "99413", "selfLink": "/api/v1/namespaces/default/pods/datanode-0", "uid": "46f1ad81-312e-4e33-b0a7-8496937511bd" }, "spec": { "affinity": { "podAntiAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchExpressions": [ { "key": "component", "operator": "In", "values": [ "datanode" ] } ] }, "topologyKey": "kubernetes.io/hostname" } ] } }, "containers": [ { "args": [ "ozone", "datanode" ], "envFrom": [ { "configMapRef": { "name": "config" } } ], "image": "eyang/ozone:0.5.0-SNAPSHOT", "imagePullPolicy": "IfNotPresent", "name": "datanode", "resources": {}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [ { "mountPath": "/data", "name": "data" }, { "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount", "name": "default-token-phlhw", "readOnly": true } ] } ], "dnsPolicy": "ClusterFirst", "enableServiceLinks": true, "hostname": "datanode-0", "priority": 0, "restartPolicy": "Always", "schedulerName": "default-scheduler",
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887388#comment-16887388 ] Eric Yang commented on HDDS-1773: - {quote} As I wrote earlier: random read/write failure/slowness. Let's say one read request form every 100 requests is significant slower than the others due to a disk error (I would call it intermittent disk slowness) This can't be reproduced with this approach.{quote} In my [previous comment|https://issues.apache.org/jira/browse/HDDS-1773?focusedCommentId=16882206=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16882206], I have explained one slow read for every 100 IO is not notable due to OS and application caching. Unless constant throttle is applied for a period of time, JVM will have marginal observation on one slow IO. However, if there are bad sectors in the simulated disk. The statical difference over a period of time can be measured, and quantified. This tool can help to improve Ozone code for monitoring disk health. Is this explanation useful for the approach that is taken? > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887282#comment-16887282 ] Eric Yang commented on HDDS-1712: - {quote}Did you try it out, or is this your expectation?{quote} I tried it out using [Deploy Ozone to Kubernetes|https://cwiki.apache.org/confluence/display/HADOOP/Deploy+Ozone+to+Kubernetes] instructions on Ozone wiki. No successful cluster deployed. Is this instruction up to date? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test
[ https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887266#comment-16887266 ] Eric Yang commented on HDFS-14461: -- [~hexiaoqiao] This is not related to HADOOP-16354 or HADOOP-16314. The error message indicate there is a race condition between MiniKdc startup and router.start(). The master key for MiniKdc has not been written yet in the setup() method, and the router is already attempting to start up and login to Kerberos. If I put a one second sleep after {code} Configuration conf = SecurityConfUtil.initSecurity(); {code} The router server can start properly. However, the test failed with: {code} [ERROR] testGetDelegationToken(org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken) Time elapsed: 1.551 s <<< ERROR! java.io.IOException: Security enabled but user not authenticated by filter at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:551) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$800(WebHdfsFileSystem.java:136) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:898) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:864) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:663) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:701) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:697) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1749) at org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken(TestRouterHttpDelegationToken.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at
[jira] [Commented] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887186#comment-16887186 ] Eric Yang commented on HDDS-1771: - {quote}With an always slow disk the scm can't be started therefore there couldn't be any in-flight connections. {quote} Not true. Even with a slow disk, it is possible to start scm. In the case where disk IO is barely enough, scm can start, and writing data to disk buffers (application side cache), it only started to degrade after some IO operations. In this case, IOException may be throw when detecting scm disk is the bottleneck. {quote}It's not a ready to use test, I can't schedule it to run every night.{quote} Not true, try to create a maven job and run: {code} mvn -f pom.ozone.xml clean verify -Dmaven.javadoc.skip=true -Pit,docker-build,dist {code} This command works, when HDDS-1554 and HDDS-1771 patches are both applied. {quote}I think the real question (at least for me) is that how the intermittent/random read/write failures/slowness are handled, but this approach can't test these questions.{quote} Base on our meeting of not conflating separate issues between slow disk and intermittent failures. We have a separate ticket HDDS-1773 for intermittent failure. This was based on your feedback of not conflating separate issues. Do you wish to combine both tickets now or continue to discuss them separately? > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886653#comment-16886653 ] Eric Yang commented on HDDS-1712: - [~elek] core-site.xml is required because fs.defaultName needs to be specified. If there is no core-site.xml with volume and bucket in URL, then the test code does not test Ozone. [~anu] Doesn't Ozone quick start guide refer to use docker-compose to start the cluster? This puts Docker image on the critical path for most users to try it out. Why ask people to try it out with docker, if you have no intention to finish what you started? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886645#comment-16886645 ] Eric Yang commented on HDDS-1771: - {quote} But can you please defined what is the expected behavior? It's not clear (for me) from the tests. I assume that a good test should have some kind of assertions. What is the assertion here?{quote} The existing ITReadWrite tests are supposed to pass unless the user defined rate is too slow for normal operations. When this happens, there should be some error message in logs or UI to report unhealthy disk/nodes. {quote}What is your expectation in case of a very slow hard disk? To drop client connections? (If I understood well, this is what you mentioned.). To throw an IOException?{quote} IOException may be thrown on a connection that is in-flight. If connection has not been established, it may throw connection refused or service unavailable exceptions. HA logic and disk health detection logics hasn't been implemented. The tests can be added later, and keep this JIRA as a turning knob for testing slow disks to find out the minimum IO rate required for a normal operation. > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reopened HDDS-1712: - Reopen because security is important. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886364#comment-16886364 ] Eric Yang commented on HDDS-1712: - {quote}2. grep for OZONE-SITE instead of CORE-SITE? The workflow is very similar to the docker-compose clusters just using kubernetes configmap instead of env files.{quote} Kubernetes configmap of trunk looks like this: {code} data: OZONE-SITE.XML_hdds.datanode.dir: /data/storage OZONE-SITE.XML_ozone.scm.datanode.id.dir: /data OZONE-SITE.XML_ozone.metadata.dirs: /data/metadata OZONE-SITE.XML_ozone.scm.block.client.address: scm-0.scm OZONE-SITE.XML_ozone.om.address: om-0.om OZONE-SITE.XML_ozone.scm.client.address: scm-0.scm OZONE-SITE.XML_ozone.scm.names: scm-0.scm OZONE-SITE.XML_ozone.enabled: "true" LOG4J.PROPERTIES_log4j.rootLogger: INFO, stdout LOG4J.PROPERTIES_log4j.appender.stdout: org.apache.log4j.ConsoleAppender LOG4J.PROPERTIES_log4j.appender.stdout.layout: org.apache.log4j.PatternLayout LOG4J.PROPERTIES_log4j.appender.stdout.layout.ConversionPattern: '%d{-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n' {code} There is no core-site.xml generated. How can the test case be valid? {quote}If I understood well we can agree that the mentioned statement was not true and kubernetes examples doesn't use replication factor 1.{quote} I can agree on replication factor of 1 does not apply to k8s tests, but it doesn't change the fact that the current k8s tests uses invalid core-site.xml, hence the test results passing are questionable. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886360#comment-16886360 ] Eric Yang commented on HDDS-1712: - {quote}It's not enforced. As you can add additional mount, the uid lines also can be removed.{quote} Not enforcing security is exactly what went wrong in smoke test code evolution. Why are we still arguing against enforcing security? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886343#comment-16886343 ] Eric Yang commented on HDDS-1712: - {quote}See my comment in the pull request, this is an independent problem. Even without sudo I can do the same (use ubuntu image + mount host path){quote} Please demonstrate. If -u ${UID}:${GID} is enforced, and UID does not have sudo access, and host mounting paths are permissively allowed? docker -u flag and mounting path can be audited before source code is committed. By implementing a few simple procedures, this will make Ozone docker image more secure and less abuse on root power. We should not provide false impression to user that we are starting with -u hadoop, then go behind user's back to run sudo curl install. Otherwise, it breaks user's trust to use Ozone-runner image. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886332#comment-16886332 ] Eric Yang edited comment on HDDS-1712 at 7/16/19 5:36 PM: -- [~elek] {quote}Definitely not. This patch breaks something which works currently. If some of the mentioned points makes harder to post a proper, fully functional patch, please fix that issue in advance . Thanks a lot.{quote} This is quite disappointing. Two branches arrangement makes it not possible to provide fully functional patch upfront. The docker image must be committed, and produced a version, then the sequent patch can reference to the docker image. It is not possible to provide a fully functional patches, unless a commit and build tag has been made. In your own code change, you have done exactly this in HDDS-1799. You are committing pull request 4 without a fully functional pull request 1105. If you give yourself a lower standard because you are in control of the source code. Why do you ask higher standard from others? You should not use the double standard on others if you can not meet your own terms. I will provide a second patch for review, but it will not be the exact code to be commit because of the two phase commit issues in current code structure. Would you be open to 99% functional patch for the second patch? {quote}I am not sure about kubernetes. Can you please prove this statement (for kubernetes).{quote} I can't find required core-site.xml values in k8s examples. {code} $ pwd /home/eyang/test/hadoop/hadoop-ozone/dist/src/main/k8s/examples [eyang@localhost examples]$ grep -R CORE-SITE * [eyang@localhost examples]$ {code} How does Kubernetes test work, if core-site.xml contain no configuration? Please educate me the process of config files to be generated for Kubernetes. was (Author: eyang): [~elek] {quote}Definitely not. This patch breaks something which works currently. If some of the mentioned points makes harder to post a proper, fully functional patch, please fix that issue in advance . Thanks a lot.{quote} This is quite disappointing. Two branches arrangement makes it not possible to provide fully functional patch upfront. The docker image must be committed, and produced a version, then the sequent patch can reference to the docker image. It is not possible to provide a fully functional patches, unless a commit and build tag has been made. In your own code change, you have done exactly this in HDDS-1799. You are committing pull request 4 without a fully functional pull request 1105. If you give yourself a lower standard because you are in control of the source code. Why do you ask higher standard from others? You should not use the double standard on others if you can not meet your own terms. I will provide a second patch for review, but it will not be the exact code to be commit because of the two phase commit issues in current code structure. Would you be open to 99% functional patch for the second patch? {quote}I am not sure about kubernetes. Can you please prove this statement (for kubernetes).{quote} {code}$ pwd /home/eyang/test/hadoop/hadoop-ozone/dist/src/main/k8s/examples [eyang@localhost examples]$ grep -R CORE-SITE * [eyang@localhost examples]${code} How does Kubernetes test work, if core-site.xml contain no configuration? Please educate me the process of config files to be generated for Kubernetes. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886332#comment-16886332 ] Eric Yang commented on HDDS-1712: - [~elek] {quote}Definitely not. This patch breaks something which works currently. If some of the mentioned points makes harder to post a proper, fully functional patch, please fix that issue in advance . Thanks a lot.{quote} This is quite disappointing. Two branches arrangement makes it not possible to provide fully functional patch upfront. The docker image must be committed, and produced a version, then the sequent patch can reference to the docker image. It is not possible to provide a fully functional patches, unless a commit and build tag has been made. In your own code change, you have done exactly this in HDDS-1799. You are committing pull request 4 without a fully functional pull request 1105. If you give yourself a lower standard because you are in control of the source code. Why do you ask higher standard from others? You should not use the double standard on others if you can not meet your own terms. I will provide a second patch for review, but it will not be the exact code to be commit because of the two phase commit issues in current code structure. Would you be open to 99% functional patch for the second patch? {quote}I am not sure about kubernetes. Can you please prove this statement (for kubernetes).{quote} {code}$ pwd /home/eyang/test/hadoop/hadoop-ozone/dist/src/main/k8s/examples [eyang@localhost examples]$ grep -R CORE-SITE * [eyang@localhost examples]${code} How does Kubernetes test work, if core-site.xml contain no configuration? Please educate me the process of config files to be generated for Kubernetes. > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886298#comment-16886298 ] Eric Yang edited comment on HDDS-1771 at 7/16/19 4:58 PM: -- [~elek] This test helps to develop a set of timing profiles for disk IO rates. If the disk is too slow, it would be helpful to detect the IO problems and present informative error message to system administrator for troubleshoots. This can save time in problem determination. If the disk performance is degraded mode, this test can help to develop Ozone reliability logic to throttle client connections and save cpu cycles to perform other tasks like replicating rocksdb to other disks or black lists metadata disks. was (Author: eyang): [~elek] This test helps to develop a set of timing profiles for disk IO rates. If the disk is too slow, it would be helpful to detect the IO problems and present informative error message to system administrator for troubleshoots. This can save time in problem determination. If the disk performance is degraded mode, this test can help to develop Ozone reliability logic to throttle client connections and save cpu cycles to perform other tasks like replications and black lists. > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886298#comment-16886298 ] Eric Yang commented on HDDS-1771: - [~elek] This test helps to develop a set of timing profiles for disk IO rates. If the disk is too slow, it would be helpful to detect the IO problems and present informative error message to system administrator for troubleshoots. This can save time in problem determination. If the disk performance is degraded mode, this test can help to develop Ozone reliability logic to throttle client connections and save cpu cycles to perform other tasks like replications and black lists. > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1712) Remove sudo access from Ozone docker image
[ https://issues.apache.org/jira/browse/HDDS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886283#comment-16886283 ] Eric Yang commented on HDDS-1712: - [~elek] Root can jail break from container, when mounting host level files is allowed, such as mounting /etc/passwd, /proc, /sys/fs. In the Pull Request #1053, it demonstrates the danger to give hadoop user root privileges without restriction. By printing a write line to /etc/passwd file, this allows hadoop user to install a root user into host. Hadoop user has the power to create chaos, when too much privileges is given. We can remove the risk by giving it non-root privileges access in container. Hadoop user is given sudo access for binary installation during test runtime. The flow of package installation logic can happen during compilation or package phase of maven build cycle. By removing the sudo access, it will force developer to rethink how to instrument test into the running container more efficiently without the duplicated downloads of test framework from internet in the current smoke test. If we can expand on the idea to build docker image after tarball creation (HDDS-1495) rather than current runner image layout, then forward progress would be easier. I find it difficult to operate in reactive approach to remove sudo requirement and make the current smoke test work with ozone-runner or hadoop-runner because: # The sudo code is in a separate branch from smoke test. I can not make smoke test changes in this ticket because smoke test logic resides in another branch. # Many binary download and installation during test run. It takes quite a long time to repeat install binaries during test run. On flaky internet, the test cases fails more frequently due to inability to install test framework rather than running the tests. # The current smoke tests and Kubernetes cluster are working with replication factor of 1, and many tests are using empty core-site.xml, hence, the disk operations are not distributed. Hence, I found the current smoke test confusing because the test parameters are invalid. # Need on demand configuration changes - maven resource templating allows to modify environment variables prior to startup of test runs. There is a mismatch between test generated volume and bucket and core-site.xml configuration. Bucket creation sequence and configuration file generation, and daemon startup are in non-specific order. The current tests are masking problems because a empty configuration leading to use local disk and allowed some tests to pass. To properly address those problems, the conversations are much longer ones. This is my reasoning to narrow the scope of this patch to first step of removing the root power. Would you be open to fix smoke test on a follow up ticket? > Remove sudo access from Ozone docker image > -- > > Key: HDDS-1712 > URL: https://issues.apache.org/jira/browse/HDDS-1712 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: pull-request-available > Attachments: HDDS-1712.001.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone docker image is given unlimited sudo access to hadoop user. This poses > a security risk where host level user uid 1000 can attach a debugger to the > container process to obtain root access. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884241#comment-16884241 ] Eric Yang commented on HDDS-1773: - Patch 002 provides setup-acid.sh and cleanup-acid.sh to generate a faulty disk. Those script requires admin privileges to generate a faulty virtual disk. README file contains step by step instruction on how to run ITAcid test case to exercise Ozone on the faulty disk. > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1773: Attachment: HDDS-1773.002.patch > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch, HDDS-1773.002.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1773) Add intermittent IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884164#comment-16884164 ] Eric Yang commented on HDDS-1773: - {quote}I agree that it's easy. The problem is that it can't simulate a certain type of disk failures.{quote} Can you give an example? > Add intermittent IO disk test to fault injection test > - > > Key: HDDS-1773 > URL: https://issues.apache.org/jira/browse/HDDS-1773 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1773.001.patch > > > Disk errors can also be simulated by setting cgroup blkio rate to 0 while > Ozone cluster is running. > This test will be added to corruption test project and this test will only be > performed if there is write access into host cgroup to control the throttle > of disk IO. > Expected result: > When datanode becomes irresponsive due to slow io, scm must flag the node as > unhealthy. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1774) Add disk hang test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884161#comment-16884161 ] Eric Yang commented on HDDS-1774: - Patch 001 is based on HDDS-1772 patch 3. This patch adds a disk hang test for throttling datanode data disk availability, and run the standard upload and download tests. > Add disk hang test to fault injection test > -- > > Key: HDDS-1774 > URL: https://issues.apache.org/jira/browse/HDDS-1774 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1774.001.patch > > > When disk is corrupted, the disk may show behavior of hang in accessing data. > One of the simulation that can be performed is to set disk IO throughput to > 0 bytes/sec to simulate disk hang. Ozone file system client can detect disk > access timeout, and proceed to read/write data to another datanode. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1774) Add disk hang test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1774: Attachment: HDDS-1774.001.patch > Add disk hang test to fault injection test > -- > > Key: HDDS-1774 > URL: https://issues.apache.org/jira/browse/HDDS-1774 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1774.001.patch > > > When disk is corrupted, the disk may show behavior of hang in accessing data. > One of the simulation that can be performed is to set disk IO throughput to > 0 bytes/sec to simulate disk hang. Ozone file system client can detect disk > access timeout, and proceed to read/write data to another datanode. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1772) Add disk full test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884084#comment-16884084 ] Eric Yang commented on HDDS-1772: - Rebase patch 003 to HDDS-1771 patch 003. > Add disk full test to fault injection test > -- > > Key: HDDS-1772 > URL: https://issues.apache.org/jira/browse/HDDS-1772 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1772.001.patch, HDDS-1772.002.patch, > HDDS-1772.003.patch > > > In Read-only test, one of the simulation to verify is the data disk becomes > full. This can be tested by using a small Docker data disk to simulate disk > full. When data disk is full, Ozone should continue to operate, and provide > read access to Ozone file system. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1772) Add disk full test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1772: Attachment: HDDS-1772.003.patch > Add disk full test to fault injection test > -- > > Key: HDDS-1772 > URL: https://issues.apache.org/jira/browse/HDDS-1772 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1772.001.patch, HDDS-1772.002.patch, > HDDS-1772.003.patch > > > In Read-only test, one of the simulation to verify is the data disk becomes > full. This can be tested by using a small Docker data disk to simulate disk > full. When data disk is full, Ozone should continue to operate, and provide > read access to Ozone file system. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884016#comment-16884016 ] Eric Yang commented on HDDS-1771: - Patch 3 rebased to HDDS-1554 patch 13. The rate can be customized by using: {code} mvn clean verify -Ddisk.read.bps=1mb -Ddisk.read.iops=120 -Ddisk.write.bps=300k -Ddisk.write.iops=30 -Pit,docker-build {code} This will exercise the test with: # read rate: 1mb/s, read ops: 120/s # write rate: 300k/s, write ops: 30/s > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-1771) Add slow IO disk test to fault injection test
[ https://issues.apache.org/jira/browse/HDDS-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated HDDS-1771: Attachment: HDDS-1771.003.patch > Add slow IO disk test to fault injection test > - > > Key: HDDS-1771 > URL: https://issues.apache.org/jira/browse/HDDS-1771 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Eric Yang >Priority: Major > Attachments: HDDS-1771.001.patch, HDDS-1771.002.patch, > HDDS-1771.003.patch > > > In fault injection test, one possible simulation is to create slow disk IO. > This test can assist in developing a set of timing profiles that works for > Ozone cluster. When we write to a file, the data travels across a bunch of > buffers and caches before it is effectively written to the disk. By > controlling cgroup blkio rate in Linux Kernel, we can simulate slow disk > read, write. Docker provides the following parameters to control cgroup: > {code} > --device-read-bps="" > --device-write-bps="" > --device-read-iops="" > --device-write-iops="" > {code} > The test will be added to read/write test with docker compose file as > parameters to test the timing profiles. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org