How to make Spark-sql join using HashJoin

Benyi Wang Mon, 06 Oct 2014 16:21:52 -0700

I'm using CDH 5.1.0 with Spark-1.0.0. There is spark-sql-1.0.0 in clouder'a
maven repository. After put it into the classpath, I can use spark-sql in
my application.


One of issue is that I couldn't make the join as a hash join. It gives
CartesianProduct when I join two SchemaRDDs as follows:

scala> val event =
sqlContext.parquetFile("/events/2014-09-28").select('MediaEventID).join(log,
joinType=LeftOuter, on=Some("event.eventid".attr === "log.eventid".attr))
== Query Plan ==
BroadcastNestedLoopJoin LeftOuter, Some(('event.eventid = 'log.eventid))
 ParquetTableScan [eventid#130L], (ParquetRelation /events/2014-09-28), None
 ParquetTableScan [eventid#125L,listid#126L,isfavorite#127],
(ParquetRelation /logs/eventdt=2014-09-28), None

If I join with another SchemaRDD, I would get Cartesian Product. Is it
possible that make the join as a hash join in Spark-1.0.0?

How to make Spark-sql join using HashJoin

Reply via email to