Thanks Bart. I'll give it a try. Presto has done something very similar on this (thanks DB for finding this!). They published an article ([1]) last year with a very thorough analysis on all the cases which I think can be used as a reference for the implementation in Spark.
[1]: https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html On Wed, Aug 26, 2020 at 1:37 AM Bart Samwel <bart.sam...@databricks.com> wrote: > IMO it's worth an attempt. The previous attempts seem to be closed because > of a general sense that this gets messy and leads to lots of special cases, > but that's just how it is. This optimization would make the difference > between getting sub-par performance for using some of these datatypes to > getting decent performance. Also, even if the predicate doesn't get pushed > down, the transformation can make execution of the predicate faster. So > this can be an early optimization rule, not tied to pushdowns specifically. > > I agree that it gets tricky for some data types. So I'd suggest starting > small and doing this only for integers. Then cover decimals. For those data > types at least you can easily reason that the conversion is correct. Other > data types are a lot trickier and we should analyze them one by one. > > On Tue, Aug 25, 2020 at 7:31 PM Chao Sun <sunc...@apache.org> wrote: > >> Hi, >> >> So just realized there were already multiple attempts on this issue in >> the past. From the discussion it seems the preferred approach is to >> eliminate the cast before they get pushed to data sources, at least for a >> few common cases such as numeric types. However, a few PRs following this >> direction were rejected (see [1] and [2]), so I'm wondering if this is >> still something worth trying, or if the community thinks this is risky and >> better not touch it. >> >> On the other hand, perhaps we can do the minimum and generate some sort >> of warning to remind users that they need to explicitly add cast to enable >> pushdown in this case. What do you think? >> >> Thanks for your input! >> Chao >> >> >> [1]: https://github.com/apache/spark/pull/8718 >> [2]: https://github.com/apache/spark/pull/27648 >> >> On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <sunc...@apache.org> wrote: >> >>> > Currently we can't. This is something we should improve, by either >>> pushing down the cast to the data source, or simplifying the predicates >>> to eliminate the cast. >>> >>> Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to >>> track this. Welcome to comment on the JIRA. >>> >>> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Currently we can't. This is something we should improve, by either >>>> pushing down the cast to the data source, or simplifying the predicates to >>>> eliminate the cast. >>>> >>>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <bart.sam...@databricks.com> >>>> wrote: >>>> >>>>> And how are we doing here on integer pushdowns? If someone does e.g. >>>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000" >>>>> without the cast? >>>>> >>>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> Thanks! That's exactly what I was hoping for! Thanks for finding the >>>>>> Jira for me! >>>>>> >>>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I think this is not a problem in 3.0 anymore, see >>>>>>> https://issues.apache.org/jira/browse/SPARK-27638 >>>>>>> >>>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> I've just run into this issue again with another user and I feel >>>>>>>> like most folks here have seen some flavor of this at some point. >>>>>>>> >>>>>>>> The user registers a Datasource with a column of type Date (or some >>>>>>>> non string) then performs a query that looks like. >>>>>>>> >>>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'* >>>>>>>> >>>>>>>> Seeing that the predicate literal here is a String, Spark needs to >>>>>>>> make a change so that the DataSource column will be of the same type >>>>>>>> (Date), >>>>>>>> so it places a "Cast" on the Datasource column so our plan ends up >>>>>>>> looking like. >>>>>>>> >>>>>>>> Cast(date_col as String) > '2020-08-03' >>>>>>>> >>>>>>>> Since the Datasource Strategies can't handle a push down of the >>>>>>>> "Cast" function we lose the predicate pushdown we could >>>>>>>> have had. This can change a Job from a single partition lookup into >>>>>>>> a full scan leading to a very confusing situation for >>>>>>>> the end user. I also wonder about the relative cost here since we >>>>>>>> could be avoiding doing X casts and instead just do a single >>>>>>>> one on the predicate, in addition we could be doing the cast at the >>>>>>>> Analysis phase and cut the run short before any work even >>>>>>>> starts rather than doing a perhaps meaningless comparison between a >>>>>>>> date and a non-date string. >>>>>>>> >>>>>>>> I think we should seriously consider whether in cases like this we >>>>>>>> should attempt to cast the literal rather than casting the >>>>>>>> source column. >>>>>>>> >>>>>>>> Please let me know if anyone has thoughts on this, or has some >>>>>>>> previous Jiras I could dig into if it's been discussed before, >>>>>>>> Russ >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Bart Samwel >>>>> bart.sam...@databricks.com >>>>> >>>>> >>>>> > > -- > Bart Samwel > bart.sam...@databricks.com > > >