Consider a data source that has data in 500mb files, and doesn't support predicate push down. Spark will have to load all the data into memory before it can apply filtering, select "columns" etc. Each 500mb file will at some point have to be loaded entirely in memory. Now consider a data source that does support predicate push down, like mysql. Spark will only need to retrieve the rows and columns it needs as the db provides an interface for it to do so. If the underlying data source supports predicate push down, and the corresponding connector supports it, then filtering, projection, etc. is pushed down to the storage level. If not, the full dataset needs to be loaded into memory, and filtering, projection, etc. would happen there.
Get Outlook for Android<https://aka.ms/ghei36> On Thu, Nov 17, 2016 at 7:50 AM +0000, "kant kodali" <kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote: Hi Assaf, I am still trying to understand the merits of predicate push down from the examples you pointed out. Example 1: Say we don't have a predicate push down feature why does spark needs to pull all the rows and filter it in memory? why not simply issue select statement with "where" clause to do the filtering via JDBC or something? Example 2: Same Argument as Example 1 except when we don't have a predicate push down feature we could simply do it using JOIN and where operators in the SQL statement right. I feel like I am missing something to understand the merits of predicate push down. Thanks, kant On Wed, Nov 16, 2016 at 11:34 PM, Mendelson, Assaf <assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote: Actually, both you translate to the same plan. When you do sql("some code") or filter, it doesn't actually do the query. Instead it is translated to a plan (parsed plan) which transform everything into standard spark expressions. Then spark analyzes it to fill in the blanks (what is users table for example) and attempts to optimize it. Predicate pushdown happens in the optimization portion. For example, let's say that users would actually be backed by a table on an sql query in mysql. Without predicate pushdown spark would first pull the entire users table from mysql and only then do the filtering. Predicate pushdown would mean the filtering would be done as part of the original sql query. Another (probably better) example would be something like having two table A and B which are joined by some common key. Then a filtering is done on the key. Moving the filter to be before the join would probably make everything faster as filter is a faster operation than a join. Assaf. From: kant kodali [mailto:kanth...@gmail.com<mailto:kanth...@gmail.com>] Sent: Thursday, November 17, 2016 8:03 AM To: user @spark Subject: How does predicate push down really help? How does predicate push down really help? in the following cases val df1 = spark.sql("select * from users where age > 30") vs val df1 = spark.sql("select * from users") df.filter("age > 30")