subject:"BroadcastHashJoin when RDD is not cached"

Re: BroadcastHashJoin when RDD is not cached

2015-07-02 Thread Srikanth

Good to know this will be in next release. Thanks. On Wed, Jul 1, 2015 at 3:13 PM, Michael Armbrust mich...@databricks.com wrote: We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though (

BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Srikanth

Hello, I have a straight forward use case of joining a large table with a smaller table. The small table is within the limit I set for spark.sql.autoBroadcastJoinThreshold. I notice that ShuffledHashJoin is used to perform the join. BroadcastHashJoin was used only when I pre-fetched and cached

Re: BroadcastHashJoin when RDD is not cached

2015-07-01 Thread Michael Armbrust

We don't know that the table is small unless you cache it. In Spark 1.5 you'll be able to give us a hint though ( https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L581 ) On Wed, Jul 1, 2015 at 8:30 AM, Srikanth srikanth...@gmail.com wrote: