Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2)

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such