I think this is not a problem in 3.0 anymore, see https://issues.apache.org/jira/browse/SPARK-27638
On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > I've just run into this issue again with another user and I feel like most > folks here have seen some flavor of this at some point. > > The user registers a Datasource with a column of type Date (or some non > string) then performs a query that looks like. > > *SELECT * from Source WHERE date_col > '2020-08-03'* > > Seeing that the predicate literal here is a String, Spark needs to make a > change so that the DataSource column will be of the same type (Date), > so it places a "Cast" on the Datasource column so our plan ends up looking > like. > > Cast(date_col as String) > '2020-08-03' > > Since the Datasource Strategies can't handle a push down of the "Cast" > function we lose the predicate pushdown we could > have had. This can change a Job from a single partition lookup into a full > scan leading to a very confusing situation for > the end user. I also wonder about the relative cost here since we could be > avoiding doing X casts and instead just do a single > one on the predicate, in addition we could be doing the cast at the > Analysis phase and cut the run short before any work even > starts rather than doing a perhaps meaningless comparison between a date > and a non-date string. > > I think we should seriously consider whether in cases like this we should > attempt to cast the literal rather than casting the > source column. > > Please let me know if anyone has thoughts on this, or has some previous > Jiras I could dig into if it's been discussed before, > Russ >