Re: Weird results with Spark SQL Outer joins

Kevin Peng Tue, 03 May 2016 11:59:18 -0700

Mike,

It looks like you are right.  The result seem to be fine.  It looks like I
messed up on the filtering clause.


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE (s.date >= '2016-01-03' OR s.date IS NULL) AND (d.date >=
'2016-01-03' OR d.date IS NULL)").count()
res2: Long = 53042

Davies, Cesar, Gourav,

Thanks for the support.

KP

On Tue, May 3, 2016 at 11:26 AM, Michael Segel <msegel_had...@hotmail.com>
wrote:

> Silly question?
>
> If you change the predicate to
> ( s.date >= ‘2016-01-03’ OR s.date IS NULL )
> AND
> (d.date >= ‘2016-01-03’ OR d.date IS NULL)
>
> What do you get?
>
> Sorry if the syntax isn’t 100% correct. The idea is to not drop null
> values from the query.
> I would imagine that this shouldn’t kill performance since its most likely
> a post join filter on the result set?
> (Or is that just a Hive thing?)
>
> -Mike
>
> > On May 3, 2016, at 12:42 PM, Davies Liu <dav...@databricks.com> wrote:
> >
> > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> > '2016-01-03' is the root cause,
> > which will filter out all the nulls from outer join, will have same
> > result as inner join.
> >
> > In Spark 2.0, we turn these join into inner join actually.
> >
> > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ces...@gmail.com> wrote:
> >> Hi
> >>
> >> Have you tried the joins without the where clause? When you use them
> you are
> >> filtering all the rows with null columns in those fields. In other
> words you
> >> are doing a inner join in all your queries.
> >>
> >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com>
> >> wrote:
> >>>
> >>> Hi Kevin,
> >>>
> >>> Having given it a first look I do think that you have hit something
> here
> >>> and this does not look quite fine. I have to work on the multiple AND
> >>> conditions in ON and see whether that is causing any issues.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kpe...@gmail.com> wrote:
> >>>>
> >>>> Davies,
> >>>>
> >>>> Here is the code that I am typing into the spark-shell along with the
> >>>> results (my question is at the bottom):
> >>>>
> >>>> val dps =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/dps_csv/")
> >>>> val swig =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/swig_csv/")
> >>>>
> >>>> dps.count
> >>>> res0: Long = 42694
> >>>>
> >>>> swig.count
> >>>> res1: Long = 42034
> >>>>
> >>>>
> >>>> dps.registerTempTable("dps_pin_promo_lt")
> >>>> swig.registerTempTable("swig_pin_promo_lt")
> >>>>
> >>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res4: Long = 42666
> >>>>
> >>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res5: Long = 34131
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> dps_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res6: Long = 42533
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> swig_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res7: Long = 34131
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res9: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res10: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res11: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res12: Long = 23809
> >>>>
> >>>>
> >>>>
> >>>> From my results above, we notice that the counts of distinct values
> based
> >>>> on the join criteria and filter criteria for each individual table is
> >>>> located at res6 and res7.  My question is why is the outer join
> producing
> >>>> less rows than the smallest table; if there are no matches it should
> still
> >>>> bring in that row as part of the outer join.  For the full and right
> outer
> >>>> join I am expecting to see a minimum of res6 rows, but I get less, is
> there
> >>>> something specific that I am missing here?  I am expecting that the
> full
> >>>> outer join would give me the union of the two table sets so I am
> expecting
> >>>> at least 42533 rows not 23809.
> >>>>
> >>>>
> >>>> Gourav,
> >>>>
> >>>> I just ran this result set on a new session with slightly newer
> data...
> >>>> still seeing those results.
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> KP
> >>>>
> >>>>
> >>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <dav...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> as @Gourav said, all the join with different join type show the same
> >>>>> results,
> >>>>> which meant that all the rows from left could match at least one row
> >>>>> from right,
> >>>>> all the rows from right could match at least one row from left, even
> >>>>> the number of row from left does not equal that of right.
> >>>>>
> >>>>> This is correct result.
> >>>>>
> >>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kpe...@gmail.com> wrote:
> >>>>>> Yong,
> >>>>>>
> >>>>>> Sorry, let explain my deduction; it is going be difficult to get a
> >>>>>> sample
> >>>>>> data out since the dataset I am using is proprietary.
> >>>>>>
> >>>>>> From the above set queries (ones mentioned in above comments) both
> >>>>>> inner and
> >>>>>> outer join are producing the same counts.  They are basically
> pulling
> >>>>>> out
> >>>>>> selected columns from the query, but there is no roll up happening
> or
> >>>>>> anything that would possible make it suspicious that there is any
> >>>>>> difference
> >>>>>> besides the type of joins.  The tables are matched based on three
> keys
> >>>>>> that
> >>>>>> are present in both tables (ad, account, and date), on top of this
> >>>>>> they are
> >>>>>> filtered by date being above 2016-01-03.  Since all the joins are
> >>>>>> producing
> >>>>>> the same counts, the natural suspicions is that the tables are
> >>>>>> identical,
> >>>>>> but I when I run the following two queries:
> >>>>>>
> >>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >>>>>>> ='2016-01-03'").count
> >>>>>>
> >>>>>> res14: Long = 34158
> >>>>>>
> >>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>>>>>> ='2016-01-03'").count
> >>>>>>
> >>>>>> res15: Long = 42693
> >>>>>>
> >>>>>>
> >>>>>> The above two queries filter out the data based on date used by the
> >>>>>> joins of
> >>>>>> 2016-01-03 and you can see the row count between the two tables are
> >>>>>> different, which is why I am suspecting something is wrong with the
> >>>>>> outer
> >>>>>> joins in spark sql, because in this situation the right and outer
> >>>>>> joins may
> >>>>>> produce the same results, but it should not be equal to the left
> join
> >>>>>> and
> >>>>>> definitely not the inner join; unless I am missing something.
> >>>>>>
> >>>>>>
> >>>>>> Side note: In my haste response above I posted the wrong counts for
> >>>>>> dps.count, the real value is res16: Long = 42694
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> KP
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <java8...@hotmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> We are still not sure what is the problem, if you cannot show us
> with
> >>>>>>> some
> >>>>>>> example data.
> >>>>>>>
> >>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full
> outer
> >>>>>>> join
> >>>>>>> with swig on 3 columns; with additional filters, get the same
> >>>>>>> resultSet row
> >>>>>>> count as dps lefter outer join with swig on 3 columns, with
> >>>>>>> additional
> >>>>>>> filters, again get the the same resultSet row count as dps right
> >>>>>>> outer join
> >>>>>>> with swig on 3 columns, with same additional filters.
> >>>>>>>
> >>>>>>> Without knowing your data, I cannot see the reason that has to be a
> >>>>>>> bug in
> >>>>>>> the spark.
> >>>>>>>
> >>>>>>> Am I misunderstanding your bug?
> >>>>>>>
> >>>>>>> Yong
> >>>>>>>
> >>>>>>> ________________________________
> >>>>>>> From: kpe...@gmail.com
> >>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700
> >>>>>>> Subject: Re: Weird results with Spark SQL Outer joins
> >>>>>>> To: gourav.sengu...@gmail.com
> >>>>>>> CC: user@spark.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>> Gourav,
> >>>>>>>
> >>>>>>> I wish that was case, but I have done a select count on each of the
> >>>>>>> two
> >>>>>>> tables individually and they return back different number of rows:
> >>>>>>>
> >>>>>>>
> >>>>>>> dps.registerTempTable("dps_pin_promo_lt")
> >>>>>>> swig.registerTempTable("swig_pin_promo_lt")
> >>>>>>>
> >>>>>>>
> >>>>>>> dps.count()
> >>>>>>> RESULT: 42632
> >>>>>>>
> >>>>>>>
> >>>>>>> swig.count()
> >>>>>>> RESULT: 42034
> >>>>>>>
> >>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >>>>>>> <gourav.sengu...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> This shows that both the tables have matching records and no
> >>>>>>> mismatches.
> >>>>>>> Therefore obviously you have the same results irrespective of
> whether
> >>>>>>> you
> >>>>>>> use right or left join.
> >>>>>>>
> >>>>>>> I think that there is no problem here, unless I am missing
> something.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gourav
> >>>>>>>
> >>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Also, the results of the inner query produced the same results:
> >>>>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> >>>>>>> d.account
> >>>>>>> AS
> >>>>>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account
> AND
> >>>>>>> s.ad
> >>>>>>> =
> >>>>>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
> >>>>>>> '2016-01-03'").count()
> >>>>>>> RESULT:23747
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> View this message in context:
> >>>>>>>
> >>>>>>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >>>>>>> Sent from the Apache Spark User List mailing list archive at
> >>>>>>> Nabble.com.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Cesar Flores
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
>
>

Re: Weird results with Spark SQL Outer joins

Reply via email to