Hi,

So just realized there were already multiple attempts on this issue in the
past. From the discussion it seems the preferred approach is to eliminate
the cast before they get pushed to data sources, at least for a few
common cases such as numeric types. However, a few PRs following this
direction were rejected (see [1] and [2]), so I'm wondering if this is
still something worth trying, or if the community thinks this is risky and
better not touch it.

On the other hand, perhaps we can do the minimum and generate some sort of
warning to remind users that they need to explicitly add cast to enable
pushdown in this case. What do you think?

Thanks for your input!
Chao


[1]: https://github.com/apache/spark/pull/8718
[2]: https://github.com/apache/spark/pull/27648

On Mon, Aug 24, 2020 at 1:57 PM Chao Sun <sunc...@apache.org> wrote:

> > Currently we can't. This is something we should improve, by either
> pushing down the cast to the data source, or simplifying the predicates
> to eliminate the cast.
>
> Hi all, I've created https://issues.apache.org/jira/browse/SPARK-32694 to
> track this. Welcome to comment on the JIRA.
>
> On Wed, Aug 19, 2020 at 7:08 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Currently we can't. This is something we should improve, by either
>> pushing down the cast to the data source, or simplifying the predicates to
>> eliminate the cast.
>>
>> On Wed, Aug 19, 2020 at 5:09 PM Bart Samwel <bart.sam...@databricks.com>
>> wrote:
>>
>>> And how are we doing here on integer pushdowns? If someone does e.g.
>>> CAST(short_col AS LONG) < 1000, can we still push down "short_col < 1000"
>>> without the cast?
>>>
>>> On Tue, Aug 4, 2020 at 6:55 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Thanks! That's exactly what I was hoping for! Thanks for finding the
>>>> Jira for me!
>>>>
>>>> On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think this is not a problem in 3.0 anymore, see
>>>>> https://issues.apache.org/jira/browse/SPARK-27638
>>>>>
>>>>> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I've just run into this issue again with another user and I feel like
>>>>>> most folks here have seen some flavor of this at some point.
>>>>>>
>>>>>> The user registers a Datasource with a column of type Date (or some
>>>>>> non string) then performs a query that looks like.
>>>>>>
>>>>>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>>>>>
>>>>>> Seeing that the predicate literal here is a String, Spark needs to
>>>>>> make a change so that the DataSource column will be of the same type
>>>>>> (Date),
>>>>>> so it places a "Cast" on the Datasource column so our plan ends up
>>>>>> looking like.
>>>>>>
>>>>>> Cast(date_col as String) > '2020-08-03'
>>>>>>
>>>>>> Since the Datasource Strategies can't handle a push down of the
>>>>>> "Cast" function we lose the predicate pushdown we could
>>>>>> have had. This can change a Job from a single partition lookup into a
>>>>>> full scan leading to a very confusing situation for
>>>>>> the end user. I also wonder about the relative cost here since we
>>>>>> could be avoiding doing X casts and instead just do a single
>>>>>> one on the predicate, in addition we could be doing the cast at the
>>>>>> Analysis phase and cut the run short before any work even
>>>>>> starts rather than doing a perhaps meaningless comparison between a
>>>>>> date and a non-date string.
>>>>>>
>>>>>> I think we should seriously consider whether in cases like this we
>>>>>> should attempt to cast the literal rather than casting the
>>>>>> source column.
>>>>>>
>>>>>> Please let me know if anyone has thoughts on this, or has some
>>>>>> previous Jiras I could dig into if it's been discussed before,
>>>>>> Russ
>>>>>>
>>>>>
>>>
>>> --
>>> Bart Samwel
>>> bart.sam...@databricks.com
>>>
>>>
>>>

Reply via email to