BroadcastHashJoin when RDD is not cached

Srikanth Wed, 01 Jul 2015 08:31:14 -0700

Hello,



I have a straight forward use case of joining a large table with a smaller
table. The small table is within the limit I set for
spark.sql.autoBroadcastJoinThreshold.

I notice that ShuffledHashJoin is used to perform the join.
BroadcastHashJoin was used only when I pre-fetched and cached the small
table.

I understand that for typical broadcast we would have to read and collect()
the small table in driver before broadcasting.

Why not do this automatically for joins? That way stage1(read large table)
and stage2(read small table) can still be run in parallel.





Sort [emailId#19 ASC,date#0 ASC], true

 Exchange (RangePartitioning 24)

  Project [emailId#19,ip#7,date#0,siteName#1,uri#3,status#12,csbytes#16L]

   Filter ((lowerTime#22 <= date#0) && (date#0 <= upperTime#23))

    *ShuffledHashJoin* [ip#7], [ip#18], BuildRight

     Exchange (HashPartitioning 24)

      Project [ip#7,date#0,siteName#1,uri#3,status#12,csbytes#16L]

       PhysicalRDD
[date#0,siteName#1,method#2,uri#3,query#4,port#5,userName#6],
MapPartitionsRDD[6] at rddToDataFrameHolder at DataSourceReader.scala:25

     Exchange (HashPartitioning 24)

      Project [emailId#19,scalaUDF(date#20) AS
upperTime#23,ip#18,scalaUDF(date#20) AS lowerTime#22]

       PhysicalRDD [ip#18,emailId#19,date#20], MapPartitionsRDD[12] at
rddToDataFrameHolder at DataSourceReader.scala:41


Srikanth

BroadcastHashJoin when RDD is not cached

Reply via email to