Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error.
Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang <gen.tan...@gmail.com> wrote: > Hi, > > I am sorry to bother again. > When I do join as follow: > df = sqlContext.sql("selet a.someItem, b.someItem from a full outer join b > on condition1 *or* condition2") > df.first() > > The program failed at the result size is bigger than > spark.driver.maxResultSize. > It is really strange, as one record is no way bigger than 1G. > When I do join on just one condition or equity condition, there will be no > problem. > > Could anyone help me, please? > > Thanks a lot in advance. > > Cheers > Gen > > > On Sun, Aug 9, 2015 at 9:08 PM, gen tang <gen.tan...@gmail.com> wrote: > >> Hi, >> >> I might have a stupid question about sparksql's implementation of join on >> not equality conditions, for instance condition1 or condition2. >> >> In fact, Hive doesn't support such join, as it is very difficult to >> express such conditions as a map/reduce job. However, sparksql supports >> such operation. So I would like to know how spark implement it. >> >> As I observe such join runs very slow, I guess that spark implement it by >> doing filter on the top of cartesian product. Is it true? >> >> Thanks in advance for your help. >> >> Cheers >> Gen >> >> >> >