It is not about lying or not or trust or not. Some or all filters may not be 
supported by a data source. Some might only be applied under certain 
environmental conditions (eg enough memory etc). 

It is much more expensive to communicate between Spark and a data source which 
filters have been applied or not than just checking it as Spark does. 
Especially if you have several different data sources at the same time (joins 
etc).

> Am 09.12.2018 um 14:30 schrieb Wenchen Fan <cloud0...@gmail.com>:
> 
> expressions/functions can be expensive and I do think Spark should trust data 
> source and not re-apply pushed filters. If data source lies, many things can 
> go wrong...
> 
>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke <jornfra...@gmail.com> wrote:
>> Well even if it has to apply it again, if pushdown is activated then it will 
>> be much less cost for spark to see if the filter has been applied or not. 
>> Applying the filter is negligible, what it really avoids if the file format 
>> implements it is IO cost (for reading) as well as cost for converting from 
>> the file format internal datatype to the one of Spark. Those two things are 
>> very expensive, but not the filter check. In the end, it could be also data 
>> source internal reasons not to apply a filter (there can be many depending 
>> on your scenario, the format etc). Instead of “discussing” between Spark and 
>> the data source it is much less costly that Spark checks that the filters 
>> are consistently applied.
>> 
>>> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando 
>>> <alessandro.solima...@gmail.com>:
>>> 
>>> Hello,
>>> that's an interesting question, but after Frank's reply I am a bit puzzled.
>>> 
>>> If there is no control over the pushdown status how can Spark guarantee the 
>>> correctness of the final query?
>>> 
>>> Consider a filter pushed down to the data source, either Spark has to know 
>>> if it has been applied or not, or it has to re-apply the filter anyway (and 
>>> pay the price for that).
>>> 
>>> Is there any other option I am not considering?
>>> 
>>> Best regards,
>>> Alessandro
>>> 
>>> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha 
>>> scritto:
>>>> BTW. Even for json a pushdown can make sense to avoid that data is 
>>>> unnecessary ending in Spark ( because it would cause unnecessary 
>>>> overhead). 
>>>> In the datasource v2 api you need to implement a SupportsPushDownFilter
>>>> 
>>>> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>:
>>>> > 
>>>> > Hi,
>>>> > 
>>>> > I'm a support engineer, interested in DataSourceV2.
>>>> > 
>>>> > Recently I had some pain to troubleshoot to check if pushdown is 
>>>> > actually applied or not.
>>>> > I noticed that DataFrame's explain() method shows pushdown even for JSON.
>>>> > It totally depends on DataSource side, I believe. However, I would like 
>>>> > Spark to have some way to confirm whether specific pushdown is actually 
>>>> > applied in DataSource or not.
>>>> > 
>>>> > # Example
>>>> > val df = spark.read.json("s3://sample_bucket/people.json")
>>>> > df.printSchema()
>>>> > df.filter($"age" > 20).explain()
>>>> > 
>>>> > root
>>>> >  |-- age: long (nullable = true)
>>>> >  |-- name: string (nullable = true)
>>>> > 
>>>> > == Physical Plan ==
>>>> > *Project [age#47L, name#48]
>>>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>>>> >    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, 
>>>> > Location: InMemoryFileIndex[s3://sample_bucket/people.json], 
>>>> > PartitionFilters: [], PushedFilters: [IsNotNull(age), 
>>>> > GreaterThan(age,20)], ReadSchema: struct<age:bigint,name:string>
>>>> > 
>>>> > # Comments
>>>> > As you can see, PushedFilter is shown even if input data is JSON.
>>>> > Actually this pushdown is not used.
>>>> >    
>>>> > I'm wondering if it has been already discussed or not.
>>>> > If not, this is a chance to have such feature in DataSourceV2 because it 
>>>> > would require some API level changes.
>>>> > 
>>>> > 
>>>> > Warm regards,
>>>> > 
>>>> > Noritaka Sekiyama
>>>> > 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> 

Reply via email to