Mike, It looks like you are right. The result seem to be fine. It looks like I messed up on the filtering clause.
sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad = d.ad) WHERE (s.date >= '2016-01-03' OR s.date IS NULL) AND (d.date >= '2016-01-03' OR d.date IS NULL)").count() res2: Long = 53042 Davies, Cesar, Gourav, Thanks for the support. KP On Tue, May 3, 2016 at 11:26 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Silly question? > > If you change the predicate to > ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) > AND > (d.date >= ‘2016-01-03’ OR d.date IS NULL) > > What do you get? > > Sorry if the syntax isn’t 100% correct. The idea is to not drop null > values from the query. > I would imagine that this shouldn’t kill performance since its most likely > a post join filter on the result set? > (Or is that just a Hive thing?) > > -Mike > > > On May 3, 2016, at 12:42 PM, Davies Liu <dav...@databricks.com> wrote: > > > > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= > > '2016-01-03' is the root cause, > > which will filter out all the nulls from outer join, will have same > > result as inner join. > > > > In Spark 2.0, we turn these join into inner join actually. > > > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ces...@gmail.com> wrote: > >> Hi > >> > >> Have you tried the joins without the where clause? When you use them > you are > >> filtering all the rows with null columns in those fields. In other > words you > >> are doing a inner join in all your queries. > >> > >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > >> wrote: > >>> > >>> Hi Kevin, > >>> > >>> Having given it a first look I do think that you have hit something > here > >>> and this does not look quite fine. I have to work on the multiple AND > >>> conditions in ON and see whether that is causing any issues. > >>> > >>> Regards, > >>> Gourav Sengupta > >>> > >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kpe...@gmail.com> wrote: > >>>> > >>>> Davies, > >>>> > >>>> Here is the code that I am typing into the spark-shell along with the > >>>> results (my question is at the bottom): > >>>> > >>>> val dps = > >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>>> "true").load("file:///home/ltu/dps_csv/") > >>>> val swig = > >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", > >>>> "true").load("file:///home/ltu/swig_csv/") > >>>> > >>>> dps.count > >>>> res0: Long = 42694 > >>>> > >>>> swig.count > >>>> res1: Long = 42034 > >>>> > >>>> > >>>> dps.registerTempTable("dps_pin_promo_lt") > >>>> swig.registerTempTable("swig_pin_promo_lt") > >>>> > >>>> sqlContext.sql("select * from dps_pin_promo_lt where date > > >>>> '2016-01-03'").count > >>>> res4: Long = 42666 > >>>> > >>>> sqlContext.sql("select * from swig_pin_promo_lt where date > > >>>> '2016-01-03'").count > >>>> res5: Long = 34131 > >>>> > >>>> sqlContext.sql("select distinct date, account, ad from > dps_pin_promo_lt > >>>> where date > '2016-01-03'").count > >>>> res6: Long = 42533 > >>>> > >>>> sqlContext.sql("select distinct date, account, ad from > swig_pin_promo_lt > >>>> where date > '2016-01-03'").count > >>>> res7: Long = 34131 > >>>> > >>>> > >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > d.account > >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN > >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND > s.ad = > >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= > '2016-01-03'").count() > >>>> res9: Long = 23809 > >>>> > >>>> > >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > d.account > >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN > >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND > s.ad = > >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= > '2016-01-03'").count() > >>>> res10: Long = 23809 > >>>> > >>>> > >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > d.account > >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN > >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND > s.ad = > >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= > '2016-01-03'").count() > >>>> res11: Long = 23809 > >>>> > >>>> > >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > d.account > >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN > >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND > s.ad = > >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= > '2016-01-03'").count() > >>>> res12: Long = 23809 > >>>> > >>>> > >>>> > >>>> From my results above, we notice that the counts of distinct values > based > >>>> on the join criteria and filter criteria for each individual table is > >>>> located at res6 and res7. My question is why is the outer join > producing > >>>> less rows than the smallest table; if there are no matches it should > still > >>>> bring in that row as part of the outer join. For the full and right > outer > >>>> join I am expecting to see a minimum of res6 rows, but I get less, is > there > >>>> something specific that I am missing here? I am expecting that the > full > >>>> outer join would give me the union of the two table sets so I am > expecting > >>>> at least 42533 rows not 23809. > >>>> > >>>> > >>>> Gourav, > >>>> > >>>> I just ran this result set on a new session with slightly newer > data... > >>>> still seeing those results. > >>>> > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> KP > >>>> > >>>> > >>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com> > >>>> wrote: > >>>>> > >>>>> as @Gourav said, all the join with different join type show the same > >>>>> results, > >>>>> which meant that all the rows from left could match at least one row > >>>>> from right, > >>>>> all the rows from right could match at least one row from left, even > >>>>> the number of row from left does not equal that of right. > >>>>> > >>>>> This is correct result. > >>>>> > >>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote: > >>>>>> Yong, > >>>>>> > >>>>>> Sorry, let explain my deduction; it is going be difficult to get a > >>>>>> sample > >>>>>> data out since the dataset I am using is proprietary. > >>>>>> > >>>>>> From the above set queries (ones mentioned in above comments) both > >>>>>> inner and > >>>>>> outer join are producing the same counts. They are basically > pulling > >>>>>> out > >>>>>> selected columns from the query, but there is no roll up happening > or > >>>>>> anything that would possible make it suspicious that there is any > >>>>>> difference > >>>>>> besides the type of joins. The tables are matched based on three > keys > >>>>>> that > >>>>>> are present in both tables (ad, account, and date), on top of this > >>>>>> they are > >>>>>> filtered by date being above 2016-01-03. Since all the joins are > >>>>>> producing > >>>>>> the same counts, the natural suspicions is that the tables are > >>>>>> identical, > >>>>>> but I when I run the following two queries: > >>>>>> > >>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date > >>>>>>> ='2016-01-03'").count > >>>>>> > >>>>>> res14: Long = 34158 > >>>>>> > >>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date > >>>>>>> ='2016-01-03'").count > >>>>>> > >>>>>> res15: Long = 42693 > >>>>>> > >>>>>> > >>>>>> The above two queries filter out the data based on date used by the > >>>>>> joins of > >>>>>> 2016-01-03 and you can see the row count between the two tables are > >>>>>> different, which is why I am suspecting something is wrong with the > >>>>>> outer > >>>>>> joins in spark sql, because in this situation the right and outer > >>>>>> joins may > >>>>>> produce the same results, but it should not be equal to the left > join > >>>>>> and > >>>>>> definitely not the inner join; unless I am missing something. > >>>>>> > >>>>>> > >>>>>> Side note: In my haste response above I posted the wrong counts for > >>>>>> dps.count, the real value is res16: Long = 42694 > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> > >>>>>> KP > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> We are still not sure what is the problem, if you cannot show us > with > >>>>>>> some > >>>>>>> example data. > >>>>>>> > >>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full > outer > >>>>>>> join > >>>>>>> with swig on 3 columns; with additional filters, get the same > >>>>>>> resultSet row > >>>>>>> count as dps lefter outer join with swig on 3 columns, with > >>>>>>> additional > >>>>>>> filters, again get the the same resultSet row count as dps right > >>>>>>> outer join > >>>>>>> with swig on 3 columns, with same additional filters. > >>>>>>> > >>>>>>> Without knowing your data, I cannot see the reason that has to be a > >>>>>>> bug in > >>>>>>> the spark. > >>>>>>> > >>>>>>> Am I misunderstanding your bug? > >>>>>>> > >>>>>>> Yong > >>>>>>> > >>>>>>> ________________________________ > >>>>>>> From: kpe...@gmail.com > >>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700 > >>>>>>> Subject: Re: Weird results with Spark SQL Outer joins > >>>>>>> To: gourav.sengu...@gmail.com > >>>>>>> CC: user@spark.apache.org > >>>>>>> > >>>>>>> > >>>>>>> Gourav, > >>>>>>> > >>>>>>> I wish that was case, but I have done a select count on each of the > >>>>>>> two > >>>>>>> tables individually and they return back different number of rows: > >>>>>>> > >>>>>>> > >>>>>>> dps.registerTempTable("dps_pin_promo_lt") > >>>>>>> swig.registerTempTable("swig_pin_promo_lt") > >>>>>>> > >>>>>>> > >>>>>>> dps.count() > >>>>>>> RESULT: 42632 > >>>>>>> > >>>>>>> > >>>>>>> swig.count() > >>>>>>> RESULT: 42034 > >>>>>>> > >>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta > >>>>>>> <gourav.sengu...@gmail.com> wrote: > >>>>>>> > >>>>>>> This shows that both the tables have matching records and no > >>>>>>> mismatches. > >>>>>>> Therefore obviously you have the same results irrespective of > whether > >>>>>>> you > >>>>>>> use right or left join. > >>>>>>> > >>>>>>> I think that there is no problem here, unless I am missing > something. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Gourav > >>>>>>> > >>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote: > >>>>>>> > >>>>>>> Also, the results of the inner query produced the same results: > >>>>>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , > >>>>>>> d.account > >>>>>>> AS > >>>>>>> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , > >>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN > >>>>>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account > AND > >>>>>>> s.ad > >>>>>>> = > >>>>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= > >>>>>>> '2016-01-03'").count() > >>>>>>> RESULT:23747 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> View this message in context: > >>>>>>> > >>>>>>> > http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html > >>>>>>> Sent from the Apache Spark User List mailing list archive at > >>>>>>> Nabble.com. > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > >> > >> > >> -- > >> Cesar Flores > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > >