Thanks Bart. I'll give it a try. Presto has done something very similar on
this (thanks DB for finding this!). They published an article ([1]) last
year with a very thorough analysis on all the cases which I think can be
used as a reference for the implementation in Spark.

[1]: https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html

On Wed, Aug 26, 2020 at 1:37 AM Bart Samwel <bart.sam...@databricks.com>
wrote:

> IMO it's worth an attempt. The previous attempts seem to be closed because
> of a general sense that this gets messy and leads to lots of special cases,
> but that's just how it is. This optimization would make the difference
> between getting sub-par performance for using some of these datatypes to
> getting decent performance. Also, even if the predicate doesn't get pushed
> down, the transformation can make execution of the predicate faster. So
> this can be an early optimization rule, not tied to pushdowns specifically.
>
> I agree that it gets tricky for some data types. So I'd suggest starting
> small and doing this only for integers. Then cover decimals. For those data
> types at least you can easily reason that the conversion is correct. Other
> data types are a lot trickier and we should analyze them one by one.
>
> On Tue, Aug 25, 2020 at 7:31 PM Chao Sun <sunc...@apache.org> wrote:
>
>> Hi,
>>
>> So just realized there were already multiple attempts on this issue in
>> the past. From the discussion it seems the preferred approach is to
>> eliminate the cast before they get pushed to data sources, at least for a
>> few common cases such as numeric types. However, a few PRs following this
>> direction were rejected (see [1] and [2]), so I'm wondering if this is
>> still something worth trying, or if the community thinks this is risky and
>> better not touch it.
>>
>> On the other hand, perhaps we can do the minimum and generate some sort
>> of warning to remind users that they need to explicitly add cast to enable
>> pushdown in this case. What do you think?
>>
>> Thanks for your input!
>> Chao
>>
>>
>> [1]: https://github.com/apache/spark/pull/8718
>> [2]: https://github.com/apache/spark/pull/27648
>>
>> On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <sunc...@apache.org> wrote:
>>
>>> > Currently we can't. This is something we should improve, by either
>>> pushing down the cast to the data source, or simplifying the predicates
>>> to eliminate the cast.
>>>
>>> Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to
>>> track this. Welcome to comment on the JIRA.
>>>
>>> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Currently we can't. This is something we should improve, by either
>>>> pushing down the cast to the data source, or simplifying the predicates to
>>>> eliminate the cast.
>>>>
>>>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <bart.sam...@databricks.com>
>>>> wrote:
>>>>
>>>>> And how are we doing here on integer pushdowns? If someone does e.g.
>>>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
>>>>> without the cast?
>>>>>
>>>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Thanks! That's exactly what I was hoping for! Thanks for finding the
>>>>>> Jira for me!
>>>>>>
>>>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think this is not a problem in 3.0 anymore, see
>>>>>>> https://issues.apache.org/jira/browse/SPARK-27638
>>>>>>>
>>>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I've just run into this issue again with another user and I feel
>>>>>>>> like most folks here have seen some flavor of this at some point.
>>>>>>>>
>>>>>>>> The user registers a Datasource with a column of type Date (or some
>>>>>>>> non string) then performs a query that looks like.
>>>>>>>>
>>>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>>>>>>
>>>>>>>> Seeing that the predicate literal here is a String, Spark needs to
>>>>>>>> make a change so that the DataSource column will be of the same type
>>>>>>>> (Date),
>>>>>>>> so it places a "Cast" on the Datasource column so our plan ends up
>>>>>>>> looking like.
>>>>>>>>
>>>>>>>> Cast(date_col as String) > '2020-08-03'
>>>>>>>>
>>>>>>>> Since the Datasource Strategies can't handle a push down of the
>>>>>>>> "Cast" function we lose the predicate pushdown we could
>>>>>>>> have had. This can change a Job from a single partition lookup into
>>>>>>>> a full scan leading to a very confusing situation for
>>>>>>>> the end user. I also wonder about the relative cost here since we
>>>>>>>> could be avoiding doing X casts and instead just do a single
>>>>>>>> one on the predicate, in addition we could be doing the cast at the
>>>>>>>> Analysis phase and cut the run short before any work even
>>>>>>>> starts rather than doing a perhaps meaningless comparison between a
>>>>>>>> date and a non-date string.
>>>>>>>>
>>>>>>>> I think we should seriously consider whether in cases like this we
>>>>>>>> should attempt to cast the literal rather than casting the
>>>>>>>> source column.
>>>>>>>>
>>>>>>>> Please let me know if anyone has thoughts on this, or has some
>>>>>>>> previous Jiras I could dig into if it's been discussed before,
>>>>>>>> Russ
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Bart Samwel
>>>>> bart.sam...@databricks.com
>>>>>
>>>>>
>>>>>
>
> --
> Bart Samwel
> bart.sam...@databricks.com
>
>
>

Reply via email to