Michael, How would the Catalyst optimizer optimize this version? df.filter(df("filter_field") === "value").select("field1").show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)?
Mohammed From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, July 17, 2015 1:39 PM To: Mike Trienis Cc: user@spark.apache.org Subject: Re: Data frames select and where clause dependency Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mike.trie...@orcsol.com<mailto:mike.trie...@orcsol.com>> wrote: I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine * df.select("field1", "filter_field").filter(df("filter_field") === "value").show() However, the next one fails with the error "in operator !Filter (filter_field#60 = value);" * df.select("field1").filter(df("filter_field") === "value").show() As a work-around, it seems that I can do the following * df.select("field1", "filter_field").filter(df("filter_field") === "value").drop("filter_field").show() Thanks, Mike.