-dev +user I'd suggest running .explain() on both dataframes to understand the performance better. The problem is likely that we have a pattern that looks for cases where you have an equality predicate where either side can be evaluated using one side of the join. We turn this into a hash join.
(df("eday") - laggard("p_eday")) === 1) is pretty tricky for us to understand, and so the pattern misses the possible optimized plan. On Wed, Aug 26, 2015 at 6:10 PM, David Smith <das...@gmail.com> wrote: > I've noticed that two queries, which return identical results, have very > different performance. I'd be interested in any hints about how avoid > problems like this. > > The DataFrame df contains a string field "series" and an integer "eday", > the > number of days since (or before) the 1970-01-01 epoch. > > I'm doing some analysis over a sliding date window and, for now, avoiding > UDAFs. I'm therefore using a self join. First, I create > > val laggard = df.withColumnRenamed("series", > "p_series").withColumnRenamed("eday", "p_eday") > > Then, the following query runs in 16s: > > df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") === > (laggard("p_eday") + 1))).count > > while the following query runs in 4 - 6 minutes: > > df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") - > laggard("p_eday")) === 1)).count > > It's worth noting that the series term is necessary to keep the query from > doing a complete cartesian product over the data. > > Ideally, I'd like to look at lags of more than one day, but the following > is > equally slow: > > df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") - > laggard("p_eday")).between(1,7)).count > > Any advice about the general principle at work here would be welcome. > > Thanks, > David > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >