Hello, We have a use case where we need to do a Cartesian join and for some reason we are not able to get it work with Dataset API's. We have similar use case implemented and working with RDD.
We have two dataset: - one data set with 2 string columns say c1, c2. It is a small data set with ~1 million records. The two columns are both strings of 32 characters so should be less than 500 mb. We broadcast this dataset - the other data set is little bigger with ~10 million records Below is the code we are using: val ds1 = spark.read.format("csv").option("header", "true").load(<s3-location>).select("c1", "c2") ds1.count val ds2 = spark.read.format("csv").load(<s3-location>). toDF("c11", "c12", "c13", "c14", "c15", "ts") ds2.count ds2.crossJoin(broadcast(ds1)).filter($"c1" <= $"c11" && $"c11" <= $"c2").count We even tried regular join, ds2.join(broadcast(ds1), $"c1" <= $"c11" && $"c11" <= $"c2") I can see the broadcast is successful 2019-02-14 23:11:55 INFO CodeGenerator:54 - Code generated in 10.469136 ms 2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Started reading broadcast variable 29 2019-02-14 23:11:55 INFO TorrentBroadcast:54 - Reading broadcast variable 29 took 6 ms 2019-02-14 23:11:56 INFO CodeGenerator:54 - Code generated in 11.280087 ms But then the stage do not progress. What am I doing wrong? Any pointers will be helpful. Thanks Ankur