expressions/functions can be expensive and I do think Spark should trust data source and not re-apply pushed filters. If data source lies, many things can go wrong...
On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke <jornfra...@gmail.com> wrote: > Well even if it has to apply it again, if pushdown is activated then it > will be much less cost for spark to see if the filter has been applied or > not. Applying the filter is negligible, what it really avoids if the file > format implements it is IO cost (for reading) as well as cost for > converting from the file format internal datatype to the one of Spark. > Those two things are very expensive, but not the filter check. In the end, > it could be also data source internal reasons not to apply a filter (there > can be many depending on your scenario, the format etc). Instead of > “discussing” between Spark and the data source it is much less costly that > Spark checks that the filters are consistently applied. > > Am 09.12.2018 um 12:39 schrieb Alessandro Solimando < > alessandro.solima...@gmail.com>: > > Hello, > that's an interesting question, but after Frank's reply I am a bit puzzled. > > If there is no control over the pushdown status how can Spark guarantee > the correctness of the final query? > > Consider a filter pushed down to the data source, either Spark has to know > if it has been applied or not, or it has to re-apply the filter anyway (and > pay the price for that). > > Is there any other option I am not considering? > > Best regards, > Alessandro > > Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha > scritto: > >> BTW. Even for json a pushdown can make sense to avoid that data is >> unnecessary ending in Spark ( because it would cause unnecessary overhead). >> In the datasource v2 api you need to implement a SupportsPushDownFilter >> >> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com >> >: >> > >> > Hi, >> > >> > I'm a support engineer, interested in DataSourceV2. >> > >> > Recently I had some pain to troubleshoot to check if pushdown is >> actually applied or not. >> > I noticed that DataFrame's explain() method shows pushdown even for >> JSON. >> > It totally depends on DataSource side, I believe. However, I would like >> Spark to have some way to confirm whether specific pushdown is actually >> applied in DataSource or not. >> > >> > # Example >> > val df = spark.read.json("s3://sample_bucket/people.json") >> > df.printSchema() >> > df.filter($"age" > 20).explain() >> > >> > root >> > |-- age: long (nullable = true) >> > |-- name: string (nullable = true) >> > >> > == Physical Plan == >> > *Project [age#47L, name#48] >> > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) >> > +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, >> Location: InMemoryFileIndex[s3://sample_bucket/people.json], >> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], >> ReadSchema: struct<age:bigint,name:string> >> > >> > # Comments >> > As you can see, PushedFilter is shown even if input data is JSON. >> > Actually this pushdown is not used. >> > >> > I'm wondering if it has been already discussed or not. >> > If not, this is a chance to have such feature in DataSourceV2 because >> it would require some API level changes. >> > >> > >> > Warm regards, >> > >> > Noritaka Sekiyama >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>