Re: Pushdown in DataSourceV2 question

Wenchen Fan Sun, 09 Dec 2018 05:31:45 -0800

expressions/functions can be expensive and I do think Spark should trust
data source and not re-apply pushed filters. If data source lies, many
things can go wrong...


On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke <jornfra...@gmail.com> wrote:

> Well even if it has to apply it again, if pushdown is activated then it
> will be much less cost for spark to see if the filter has been applied or
> not. Applying the filter is negligible, what it really avoids if the file
> format implements it is IO cost (for reading) as well as cost for
> converting from the file format internal datatype to the one of Spark.
> Those two things are very expensive, but not the filter check. In the end,
> it could be also data source internal reasons not to apply a filter (there
> can be many depending on your scenario, the format etc). Instead of
> “discussing” between Spark and the data source it is much less costly that
> Spark checks that the filters are consistently applied.
>
> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando <
> alessandro.solima...@gmail.com>:
>
> Hello,
> that's an interesting question, but after Frank's reply I am a bit puzzled.
>
> If there is no control over the pushdown status how can Spark guarantee
> the correctness of the final query?
>
> Consider a filter pushed down to the data source, either Spark has to know
> if it has been applied or not, or it has to re-apply the filter anyway (and
> pay the price for that).
>
> Is there any other option I am not considering?
>
> Best regards,
> Alessandro
>
> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha
> scritto:
>
>> BTW. Even for json a pushdown can make sense to avoid that data is
>> unnecessary ending in Spark ( because it would cause unnecessary overhead).
>> In the datasource v2 api you need to implement a SupportsPushDownFilter
>>
>> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com
>> >:
>> >
>> > Hi,
>> >
>> > I'm a support engineer, interested in DataSourceV2.
>> >
>> > Recently I had some pain to troubleshoot to check if pushdown is
>> actually applied or not.
>> > I noticed that DataFrame's explain() method shows pushdown even for
>> JSON.
>> > It totally depends on DataSource side, I believe. However, I would like
>> Spark to have some way to confirm whether specific pushdown is actually
>> applied in DataSource or not.
>> >
>> > # Example
>> > val df = spark.read.json("s3://sample_bucket/people.json")
>> > df.printSchema()
>> > df.filter($"age" > 20).explain()
>> >
>> > root
>> >  |-- age: long (nullable = true)
>> >  |-- name: string (nullable = true)
>> >
>> > == Physical Plan ==
>> > *Project [age#47L, name#48]
>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>> >    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
>> Location: InMemoryFileIndex[s3://sample_bucket/people.json],
>> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],
>> ReadSchema: struct<age:bigint,name:string>
>> >
>> > # Comments
>> > As you can see, PushedFilter is shown even if input data is JSON.
>> > Actually this pushdown is not used.
>> >
>> > I'm wondering if it has been already discussed or not.
>> > If not, this is a chance to have such feature in DataSourceV2 because
>> it would require some API level changes.
>> >
>> >
>> > Warm regards,
>> >
>> > Noritaka Sekiyama
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Pushdown in DataSourceV2 question

Reply via email to