Re: Can anyone please provide clue why data shuffle is trying to handle 5.1 TB shuffle block?

Asif Shahid Tue, 09 Sep 2025 13:28:52 -0700

My thoughts:
1) If one of table involved in join is relatively small, and the plan is
not creating BroadcastHashJoin, then force it to create BHJ by:
a) Explicit hint
b) increasing auto broadcast threshold property ( make sure you do not put
it more than 4 - 6GB, as with 8GB exceeded you will get error)
c) analyze the tables before running the query to generate stats., it would
be good to generate stats on atleast the joining columns and  columns on
which non trivial  filters ( i.e exclude null not null )are present
2) If the joining columns are not partitioned, then try to make tables
partitioned on joining column to take advantage of DPP
3) If plan is already using BHJ and for some reason joining cols cannot be
partitioned, and the join is inner, I can help check perf , but that will
have to be outside this group as those changes are not in stock spark.


Regards
Asif


On Tue, Sep 9, 2025 at 1:13 PM Jason Jun <[email protected]> wrote:

> Hi there,
>
>
> We're joining very big datasets in 5 minutes bucket in on-prem k3 env.
>
> We have this situation very often, i think shuffle partition is corrupted
> as 5.1 TB didn't make sense at all.
>
>
> We're running spark ver 3.5.2 in on-prem kubernetes with Spark Operator.
>
>
> So I'd really like to know these, and really appreciated if anyone give
> any clue :
>
>
>    1. Is this really critical or i can ignore?
>    2. what's the root cause, if its corrupted then why is it happening
>    and how i fix this?
>    3. Any recommendation for changing spark configuration, for instance,
>    tune some network configurations, or any other tuning factors?
>    4. Can upgrading spark to the latest fix this?
>
>
> Any feedback or comment would be appreciated.
>
> Thx
> Jason
>
>
>
>
> 25/09/09 19:53:28 WARN TransportChannelHandler: Exception in connection
> from /10.0.41.19:53800
>                                         │
>
> │ java.lang.IllegalArgumentException: Too large frame: 5135603447297303916
>
>                                     │
>
> │     at
> org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119)
>
>                       │
>
> │     at
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
>
>           │
>
> │     at
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98)
>
>           │
>
> │     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>
> │
>
> │     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>
> │
>
> │     at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>
>   │
>
> │     at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>
>     │
>
> │     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>
> │
>
> │     at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>
> │
>
> │     at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>
>                 │
>
> │     at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>
>       │
>
> │     at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
>
>                           │
>
> │     at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>
>                 │
>
> │     at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
>
>                             │
>
> │     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
>
>                                       │
>
> │     at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>
>             │
>
> │     at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>
>                             │
>
> │     at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
>                 │
>
> │     at java.base/java.lang.Thread.run(Thread.java:840)
>

Re: Can anyone please provide clue why data shuffle is trying to handle 5.1 TB shuffle block?

Reply via email to