Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Ryan Blue
In v2, it is up to the data source to tell Spark that a pushed filter is satisfied, by returning the pushed filters that Spark should run. You can indicate that a filter is handled by the source by not returning it for Spark. You can also show that a filter is used by the source by showing it in

Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Noritaka Sekiyama
Hi, Thank you for responding to this thread. I'm really interested in this discussion. My original idea might be the same as what Alessandro said, introducing a mechanism that Spark can communicate with DataSource and get metadata which shows if pushdown is supported or not. I'm wondering if it

Re: Pushdown in DataSourceV2 question

2018-12-10 Thread Alessandro Solimando
I think you are generally right, but there are so many different scenarios that it might not always be the best option, consider for instance a "fast" network in between a single data source and "Spark", lots of data, an "expensive" (with low selectivity) expression as Wenchen suggested. In such

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke
It is not about lying or not or trust or not. Some or all filters may not be supported by a data source. Some might only be applied under certain environmental conditions (eg enough memory etc). It is much more expensive to communicate between Spark and a data source which filters have been

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Wenchen Fan
expressions/functions can be expensive and I do think Spark should trust data source and not re-apply pushed filters. If data source lies, many things can go wrong... On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: > Well even if it has to apply it again, if pushdown is activated then it >

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke
Well even if it has to apply it again, if pushdown is activated then it will be much less cost for spark to see if the filter has been applied or not. Applying the filter is negligible, what it really avoids if the file format implements it is IO cost (for reading) as well as cost for

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Alessandro Solimando
Hello, that's an interesting question, but after Frank's reply I am a bit puzzled. If there is no control over the pushdown status how can Spark guarantee the correctness of the final query? Consider a filter pushed down to the data source, either Spark has to know if it has been applied or not,

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke
BTW. Even for json a pushdown can make sense to avoid that data is unnecessary ending in Spark ( because it would cause unnecessary overhead). In the datasource v2 api you need to implement a SupportsPushDownFilter > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama : > > Hi, > > I'm a support

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke
It was already available before DataSourceV2, but I think it might have been an internal/semi-official API (eg json is an internal datasource since some time now). The filters were provided to the datasource, but you will never know if the datasource has indeed leveraged them or if for other

Pushdown in DataSourceV2 question

2018-12-08 Thread Noritaka Sekiyama
Hi, I'm a support engineer, interested in DataSourceV2. Recently I had some pain to troubleshoot to check if pushdown is actually applied or not. I noticed that DataFrame's explain() method shows pushdown even for JSON. It totally depends on DataSource side, I believe. However, I would like