Re: Questions about SparkSQL join on not equality conditions
Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2) that I use is created from hive table(about 1G). Therefore spark think df1 is larger than df2, although df1 is very small. As a result, spark try to do df2.collect(), which causes the error. Hope this could be helpful Cheers Gen On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com wrote: Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen
Re: Questions about SparkSQL join on not equality conditions
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no way bigger than 1G. When I do join on just one condition or equity condition, there will be no problem. Could anyone help me, please? Thanks a lot in advance. Cheers Gen On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote: Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen
Questions about SparkSQL join on not equality conditions
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement it. As I observe such join runs very slow, I guess that spark implement it by doing filter on the top of cartesian product. Is it true? Thanks in advance for your help. Cheers Gen