Definitely, thanks Mohammed. On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller <moham...@glassbeam.com> wrote:
> Thanks, Harish. > > > > Mike – this would be a cleaner version for your use case: > > df.filter(df("filter_field") === "value").select("field1").show() > > > > Mohammed > > > > *From:* Harish Butani [mailto:rhbutani.sp...@gmail.com] > *Sent:* Monday, July 20, 2015 5:37 PM > *To:* Mohammed Guller > *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org > > *Subject:* Re: Data frames select and where clause dependency > > > > Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning > > See DefaultOptimizer.batches for list of logical rewrites. > > > > You can see the optimized plan by printing: df.queryExecution.optimizedPlan > > > > On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller <moham...@glassbeam.com> > wrote: > > Michael, > > How would the Catalyst optimizer optimize this version? > > df.filter(df("filter_field") === "value").select("field1").show() > > Would it still read all the columns in df or would it read only > “filter_field” and “field1” since only two columns are used (assuming other > columns from df are not used anywhere else)? > > > > Mohammed > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Friday, July 17, 2015 1:39 PM > *To:* Mike Trienis > *Cc:* user@spark.apache.org > *Subject:* Re: Data frames select and where clause dependency > > > > Each operation on a dataframe is completely independent and doesn't know > what operations happened before it. When you do a selection, you are > removing other columns from the dataframe and so the filter has nothing to > operate on. > > > > On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mike.trie...@orcsol.com> > wrote: > > I'd like to understand why the where field must exist in the select > clause. > > > > For example, the following select statement works fine > > - df.select("field1", "filter_field").filter(df("filter_field") === > "value").show() > > However, the next one fails with the error "in operator !Filter > (filter_field#60 = value);" > > - df.select("field1").filter(df("filter_field") === "value").show() > > As a work-around, it seems that I can do the following > > - df.select("field1", "filter_field").filter(df("filter_field") === > "value").drop("filter_field").show() > > > > Thanks, Mike. > > > > >