RE: Select all columns except some
Hello, thank you for your time. Seq[String] works perfectly fine. I also tried running a for loop through all elements to see if any access to a value was broken, but no, they are alright. For now, I solved it properly calling this. Sadly, it takes a lot of time, but works: var data_sas = sqlContext.read.format(com.github.saurfang.sas.spark).load(/path/to/file.s) data_sas.cache for (col - clean_cols) { data_sas = data_sas.drop(col) } data_sas.unpersist Saif From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Thursday, July 16, 2015 12:58 PM To: Ellafi, Saif A. Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Select all columns except some Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single column names with commas embedded). Not sure why the .drop call hangs but in either case drop returns a new dataframe -- it's not a setter call On Thu, Jul 16, 2015 at 10:57 AM, saif.a.ell...@wellsfargo.commailto:saif.a.ell...@wellsfargo.com wrote: Hi, In a hundred columns dataframe, I wish to either select all of them except or drop the ones I dont want. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name = col_name.startWith(“STATE_”).mkString(“, “) df.select(clean_cols) But this throws exception: org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt, industry_area,...’ at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.orghttp://org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) The other thing I tried is df.columns.filter(col_name = col_name.startWith(“STATE_”) for (col - cols) df.drop(col) But this other thing doesn’t do anything or hangs up. Saif
Re: Select all columns except some
Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single column names with commas embedded). Not sure why the .drop call hangs but in either case drop returns a new dataframe -- it's not a setter call On Thu, Jul 16, 2015 at 10:57 AM, saif.a.ell...@wellsfargo.com wrote: Hi, In a hundred columns dataframe, I wish to either *select all of them except* or *drop the ones I dont want*. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name = col_name.startWith(“STATE_”).mkString(“, “) df.select(clean_cols) But this throws exception: org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt, industry_area,...’ at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) The other thing I tried is df.columns.filter(col_name = col_name.startWith(“STATE_”) for (col - cols) df.drop(col) But this other thing doesn’t do anything or hangs up. Saif
Re: Select all columns except some
The snippet at the end worked for me. We run Spark 1.3.x, so DataFrame.drop is not available to us. As pointed out by Yana, DataFrame operations typically return a new DataFrame, so use as such: import com.foo.sparkstuff.DataFrameOps._ ... val df = ... val prunedDf = df.dropColumns(one_col, other_col) package com.foo.sparkstuff import org.apache.spark.sql.{Column, DataFrame} import scala.language.implicitConversions class PimpedDataFrame(frame: DataFrame) { /** * Drop named columns from dataframe. Replace with DataFrame.drop when upgrading to Spark 1.4.0. */ def dropColumns(toDrop: String*): DataFrame = { val invalid = toDrop filterNot(frame.columns.contains(_)) if (invalid.nonEmpty) { throw new IllegalArgumentException(Columns not found: + invalid.mkString(,)) } val newColumns = frame.columns filter {c = !toDrop.contains(c)} map {new Column(_)} frame.select(newColumns:_*) } } object DataFrameOps { implicit def pimpDataFrame(df: DataFrame): PimpedDataFrame = new PimpedDataFrame(df) } On Thu, Jul 16, 2015 at 4:57 PM, saif.a.ell...@wellsfargo.com wrote: Hi, In a hundred columns dataframe, I wish to either select all of them except or drop the ones I dont want. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name = col_name.startWith(“STATE_”).mkString(“, “) df.select(clean_cols) But this throws exception: org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt, industry_area,...’ at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) The other thing I tried is df.columns.filter(col_name = col_name.startWith(“STATE_”) for (col - cols) df.drop(col) But this other thing doesn’t do anything or hangs up. Saif - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org