[
https://issues.apache.org/jira/browse/TEZ-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ayush Saxena resolved TEZ-4600.
-------------------------------
Resolution: Fixed
> Secret managers in Tez should respect the algorithm set by hadoop
> -----------------------------------------------------------------
>
> Key: TEZ-4600
> URL: https://issues.apache.org/jira/browse/TEZ-4600
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Fix For: 0.10.5
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> after YARN-11738, hadoop can use a core-site config to use a default algorithm
> https://github.com/apache/hadoop/commit/b9060fc00df89a4c73d5b98947688b200b79901f
> {code}
> static {
> Configuration conf = new Configuration();
> String algorithm = conf.get(
>
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_KEY,
>
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_GENERATOR_ALGORITHM_DEFAULT);
> LOG.info("Selected hash algorithm: {}", algorithm);
> SELECTED_ALGORITHM = algorithm;
> int length = conf.getInt(
>
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_KEY,
>
> CommonConfigurationKeysPublic.HADOOP_SECURITY_SECRET_MANAGER_KEY_LENGTH_DEFAULT);
> LOG.info("Selected hash key length:{}", length);
> SELECTED_LENGTH = length;
> }
> {code}
> in case of a non-default value, key mismatch happens (as tez uses the
> hardcoded value from TEZ-1596), and tez becomes broken in different places
> 1. dagclient <-> AM communication
> {code}
> Caused by: org.apache.hadoop.ipc.RemoteException: DIGEST-SHA: digest response
> format violation. Mismatched response.
> {code}
> this is because of the ClientToAMTokenSecretManager used in DAGAppMaster,
> which doesn't apply the changed, non-default algo coming from DAG payload
> in TezAM, new Configuration() is not suitable especially in static
> initializer time, because the actual configuration values come as a payload
> from the upstream application (like HiveServer2)
> 2. secure shuffle: for which the key is handled by the JobTokenSecretManager,
> so if the algo in fetchers differs from the one in ShuffleHandler, shuffle
> fetchers faile
> on fetcher side:
> {code}
> 2025-02-25 08:36:51,482 [WARN] [Fetcher_B {Map_1 -> Reducer_2} #0]
> |shuffle.Fetcher|: Fetch Failure while connecting from
> ccycloud-2.lbodor-fips.root.comops.s
> ite to: ccycloud-2.lbodor-fips.root.comops.site:13562, attempt:
> InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0,
> pathComponent=attempt_174047246119
> 9_0001_1_00_000000_0_10002, spillType=0, spillId=-1] Informing ShuffleManager:
> java.io.IOException: Server returned HTTP response code: 401 for URL:
> https://ccycloud-2.lbodor-fips.root.comops.site:13562/mapOutput?job=job_1740472461199_0001&dag=1&reduce=0&map=attempt_1740472461199_0001_1_00_000000_0_10002&keepAlive=true
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
> at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)
> at
> org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:260)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnectionInternal(Fetcher.java:565)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:534)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:573)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:492)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:290)
> at
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:78)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
> at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> on ShuffleHandler site (this was a hadoop ShuffleHandler btw):
> {code}
> 2025-02-25 08:31:22,781 WARN org.apache.hadoop.mapred.ShuffleHandler: Shuffle
> failure
> java.io.IOException: Verification of the hashReply failed
> at
> org.apache.hadoop.mapreduce.security.SecureShuffleUtils.verifyReply(SecureShuffleUtils.java:106)
> at
> org.apache.hadoop.mapred.ShuffleChannelHandler.verifyRequest(ShuffleChannelHandler.java:470)
> at
> org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:259)
> at
> org.apache.hadoop.mapred.ShuffleChannelHandler.channelRead0(ShuffleChannelHandler.java:130)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
> at
> io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
> at
> io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> with a fix, both client<-> am comm and SSL shuffle should work
> UPDATE: only 2) applies to upstream tez, 1) was a downstream-only problem
--
This message was sent by Atlassian Jira
(v8.20.10#820010)