[ https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015362#comment-17015362 ]
Chandni Singh commented on SPARK-30512: --------------------------------------- Please assign the issue to me so I can open up a PR. > Use a dedicated boss event group loop in the netty pipeline for external > shuffle service > ---------------------------------------------------------------------------------------- > > Key: SPARK-30512 > URL: https://issues.apache.org/jira/browse/SPARK-30512 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 3.0.0 > Reporter: Chandni Singh > Priority: Major > > We have been seeing a large number of SASL authentication (RPC requests) > timing out with the external shuffle service. > The issue and all the analysis we did is described here: > [https://github.com/netty/netty/issues/9890] > I added a {{LoggingHandler}} to netty pipeline and realized that even the > channel registration is delayed by 30 seconds. > In the Spark External Shuffle service, the boss event group and the worker > event group are same which is causing this delay. > {code:java} > EventLoopGroup bossGroup = > NettyUtils.createEventLoop(ioMode, conf.serverThreads(), > conf.getModuleName() + "-server"); > EventLoopGroup workerGroup = bossGroup; > bootstrap = new ServerBootstrap() > .group(bossGroup, workerGroup) > .channel(NettyUtils.getServerChannelClass(ioMode)) > .option(ChannelOption.ALLOCATOR, allocator) > .childOption(ChannelOption.ALLOCATOR, allocator); > {code} > When the load at the shuffle service increases, since the worker threads are > busy with existing channels, registering new channels gets delayed. > The fix is simple. I created a dedicated boss thread event loop group with 1 > thread. > {code:java} > EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, > conf.getModuleName() + "-boss"); > EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, > conf.serverThreads(), > conf.getModuleName() + "-server"); > bootstrap = new ServerBootstrap() > .group(bossGroup, workerGroup) > .channel(NettyUtils.getServerChannelClass(ioMode)) > .option(ChannelOption.ALLOCATOR, allocator) > {code} > This fixed the issue. > We just need 1 thread in the boss group because there is only a single > server bootstrap. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org