My thoughts: 1) If one of table involved in join is relatively small, and the plan is not creating BroadcastHashJoin, then force it to create BHJ by: a) Explicit hint b) increasing auto broadcast threshold property ( make sure you do not put it more than 4 - 6GB, as with 8GB exceeded you will get error) c) analyze the tables before running the query to generate stats., it would be good to generate stats on atleast the joining columns and columns on which non trivial filters ( i.e exclude null not null )are present 2) If the joining columns are not partitioned, then try to make tables partitioned on joining column to take advantage of DPP 3) If plan is already using BHJ and for some reason joining cols cannot be partitioned, and the join is inner, I can help check perf , but that will have to be outside this group as those changes are not in stock spark.
Regards Asif On Tue, Sep 9, 2025 at 1:13 PM Jason Jun <[email protected]> wrote: > Hi there, > > > We're joining very big datasets in 5 minutes bucket in on-prem k3 env. > > We have this situation very often, i think shuffle partition is corrupted > as 5.1 TB didn't make sense at all. > > > We're running spark ver 3.5.2 in on-prem kubernetes with Spark Operator. > > > So I'd really like to know these, and really appreciated if anyone give > any clue : > > > 1. Is this really critical or i can ignore? > 2. what's the root cause, if its corrupted then why is it happening > and how i fix this? > 3. Any recommendation for changing spark configuration, for instance, > tune some network configurations, or any other tuning factors? > 4. Can upgrading spark to the latest fix this? > > > Any feedback or comment would be appreciated. > > Thx > Jason > > > > > 25/09/09 19:53:28 WARN TransportChannelHandler: Exception in connection > from /10.0.41.19:53800 > │ > > │ java.lang.IllegalArgumentException: Too large frame: 5135603447297303916 > > │ > > │ at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > > │ > > │ at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) > > │ > > │ at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > > │ > > │ at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > > │ > > │ at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > > │ > > │ at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > > │ > > │ at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > > │ > > │ at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) > > │ > > │ at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > > │ > > │ at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > > │ > > │ at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > > │ > > │ at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) > > │ > > │ at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) > > │ > > │ at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) > > │ > > │ at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) > > │ > > │ at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > > │ > > │ at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > > │ > > │ at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > > │ > > │ at java.base/java.lang.Thread.run(Thread.java:840) >
