Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi,

After taking a look at the code, I found out the problem:
As spark will use broadcastNestedLoopJoin to treat nonequality condition.
And one of my dataframe(df1) is created from an existing RDD(logicalRDD),
so it uses defaultSizeInBytes * length to estimate the size. The other
dataframe(df2) that I use is created from hive table(about 1G). Therefore
spark think df1 is larger than df2, although df1 is very small. As a
result, spark try to do df2.collect(), which causes the error.

Hope this could be helpful

Cheers
Gen


On Mon, Aug 10, 2015 at 11:29 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I am sorry to bother again.
 When I do join as follow:
 df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
 on condition1 *or* condition2)
 df.first()

 The program failed at the result size is bigger than 
 spark.driver.maxResultSize.
 It is really strange, as one record is no way bigger than 1G.
 When I do join on just one condition or equity condition, there will be no
 problem.

 Could anyone help me, please?

 Thanks a lot in advance.

 Cheers
 Gen


 On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen






Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi,

I am sorry to bother again.
When I do join as follow:
df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b
on condition1 *or* condition2)
df.first()

The program failed at the result size is bigger than
spark.driver.maxResultSize.
It is really strange, as one record is no way bigger than 1G.
When I do join on just one condition or equity condition, there will be no
problem.

Could anyone help me, please?

Thanks a lot in advance.

Cheers
Gen


On Sun, Aug 9, 2015 at 9:08 PM, gen tang gen.tan...@gmail.com wrote:

 Hi,

 I might have a stupid question about sparksql's implementation of join on
 not equality conditions, for instance condition1 or condition2.

 In fact, Hive doesn't support such join, as it is very difficult to
 express such conditions as a map/reduce job. However, sparksql supports
 such operation. So I would like to know how spark implement it.

 As I observe such join runs very slow, I guess that spark implement it by
 doing filter on the top of cartesian product. Is it true?

 Thanks in advance for your help.

 Cheers
 Gen