Re: Questions about SparkSQL join on not equality conditions

gen tang Tue, 11 Aug 2015 06:54:11 -0700

Hi,

After taking a look at the code, I found out the problem:
As spark will use broadcastNestedLoopJoin to treat nonequality condition.
And one of my dataframe(df1) is created from an existing RDD(logicalRDD),
so it uses defaultSizeInBytes * length to estimate the size. The other
dataframe(df2) that I use is created from hive table(about 1G). Therefore
spark think df1 is larger than df2, although df1 is very small. As a
result, spark try to do df2.collect(), which causes the error.


Hope this could be helpful

Cheers
Gen


On Mon, Aug 10, 2015 at 11:29 PM, gen tang <gen.tan...@gmail.com> wrote:

> Hi,
>
> I am sorry to bother again.
> When I do join as follow:
> df = sqlContext.sql("selet a.someItem, b.someItem from a full outer join b
> on condition1 *or* condition2")
> df.first()
>
> The program failed at the result size is bigger than 
> spark.driver.maxResultSize.
> It is really strange, as one record is no way bigger than 1G.
> When I do join on just one condition or equity condition, there will be no
> problem.
>
> Could anyone help me, please?
>
> Thanks a lot in advance.
>
> Cheers
> Gen
>
>
> On Sun, Aug 9, 2015 at 9:08 PM, gen tang <gen.tan...@gmail.com> wrote:
>
>> Hi,
>>
>> I might have a stupid question about sparksql's implementation of join on
>> not equality conditions, for instance condition1 or condition2.
>>
>> In fact, Hive doesn't support such join, as it is very difficult to
>> express such conditions as a map/reduce job. However, sparksql supports
>> such operation. So I would like to know how spark implement it.
>>
>> As I observe such join runs very slow, I guess that spark implement it by
>> doing filter on the top of cartesian product. Is it true?
>>
>> Thanks in advance for your help.
>>
>> Cheers
>> Gen
>>
>>
>>
>

Re: Questions about SparkSQL join on not equality conditions

Reply via email to