It was already available before DataSourceV2, but I think it might have been an 
internal/semi-official API (eg json is an internal datasource since some time 
now). The filters were provided to the datasource, but you will never know if 
the datasource has indeed leveraged them or if for other reasons (eg it would 
be inefficient in specific cases) decided to ignore the filters.

> Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>:
> 
> Hi,
> 
> I'm a support engineer, interested in DataSourceV2.
> 
> Recently I had some pain to troubleshoot to check if pushdown is actually 
> applied or not.
> I noticed that DataFrame's explain() method shows pushdown even for JSON.
> It totally depends on DataSource side, I believe. However, I would like Spark 
> to have some way to confirm whether specific pushdown is actually applied in 
> DataSource or not.
> 
> # Example
> val df = spark.read.json("s3://sample_bucket/people.json")
> df.printSchema()
> df.filter($"age" > 20).explain()
> 
> root
>  |-- age: long (nullable = true)
>  |-- name: string (nullable = true)
> 
> == Physical Plan ==
> *Project [age#47L, name#48]
> +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, 
> Location: InMemoryFileIndex[s3://sample_bucket/people.json], 
> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], 
> ReadSchema: struct<age:bigint,name:string>
> 
> # Comments
> As you can see, PushedFilter is shown even if input data is JSON.
> Actually this pushdown is not used.
>    
> I'm wondering if it has been already discussed or not.
> If not, this is a chance to have such feature in DataSourceV2 because it 
> would require some API level changes.
> 
> 
> Warm regards,
> 
> Noritaka Sekiyama
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to