[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820951#comment-16820951
 ] 

Mike Chan commented on SPARK-27505:
-----------------------------------

Table desc extended result:

Statistics |24452111 bytes

> autoBroadcastJoinThreshold including bigger table
> -------------------------------------------------
>
>                 Key: SPARK-27505
>                 URL: https://issues.apache.org/jira/browse/SPARK-27505
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark
>    Affects Versions: 2.3.1
>         Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>            Reporter: Mike Chan
>            Priority: Major
>         Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#000000}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#000000}
> {color}{color:#000000}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to