Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
ow() > > > > Mohammed > > > > *From:* Harish Butani [mailto:rhbutani.sp...@gmail.com] > *Sent:* Monday, July 20, 2015 5:37 PM > *To:* Mohammed Guller > *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org > > *Subject:* Re: Data frames select and where clause

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael Armbrust; Mike Trienis; user@spark.apache.org Subject: Re: Data frames select and where clause dependency Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See DefaultOptimizer.batches for list of logical rewrites. You can see the optimized plan by printing: df.queryExecution

Re: Data frames select and where clause dependency

2015-07-20 Thread Harish Butani
ing other > columns from df are not used anywhere else)? > > > > Mohammed > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Friday, July 17, 2015 1:39 PM > *To:* Mike Trienis > *Cc:* user@spark.apache.org > *Subject:* Re: Data frames select an

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
other columns from df are not used anywhere else)? Mohammed From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, July 17, 2015 1:39 PM To: Mike Trienis Cc: user@spark.apache.org Subject: Re: Data frames select and where clause dependency Each operation on a dataframe is completel

Re: Data frames select and where clause dependency

2015-07-17 Thread Michael Armbrust
Each operation on a dataframe is completely independent and doesn't know what operations happened before it. When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on. On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis wrote: > I'd like t

Data frames select and where clause dependency

2015-07-17 Thread Mike Trienis
I'd like to understand why the where field must exist in the select clause. For example, the following select statement works fine - df.select("field1", "filter_field").filter(df("filter_field") === "value").show() However, the next one fails with the error "in operator !Filter (filter_fie