RE: Select all columns except some

2015-07-17 Thread Saif.A.Ellafi
Hello, thank you for your time.

Seq[String] works perfectly fine. I also tried running a for loop through all 
elements to see if any access to a value was broken, but no, they are alright.

For now, I solved it properly calling this. Sadly, it takes a lot of time, but 
works:

var data_sas = 
sqlContext.read.format(com.github.saurfang.sas.spark).load(/path/to/file.s)
data_sas.cache
for (col - clean_cols) {
data_sas = data_sas.drop(col)
}
data_sas.unpersist

Saif


From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Thursday, July 16, 2015 12:58 PM
To: Ellafi, Saif A.
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Select all columns except some

Have you tried to examine what clean_cols contains -- I'm suspect of this part 
mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...

if you get a type error you need to work on clean_cols (I suspect yours is of 
type String at the moment and presents itself to Spark as a single column names 
with commas embedded).

Not sure why the .drop call hangs but in either case drop returns a new 
dataframe -- it's not a setter call

On Thu, Jul 16, 2015 at 10:57 AM, 
saif.a.ell...@wellsfargo.commailto:saif.a.ell...@wellsfargo.com wrote:
Hi,

In a hundred columns dataframe, I wish to either select all of them except or 
drop the ones I dont want.

I am failing in doing such simple task, tried two ways

val clean_cols = df.columns.filterNot(col_name = 
col_name.startWith(“STATE_”).mkString(“, “)
df.select(clean_cols)

But this throws exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt, 
industry_area,...’
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.orghttp://org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)

The other thing I tried is

df.columns.filter(col_name = col_name.startWith(“STATE_”)
for (col - cols) df.drop(col)

But this other thing doesn’t do anything or hangs up.

Saif






Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska
Have you tried to examine what clean_cols contains -- I'm suspect of this
part mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...

if you get a type error you need to work on clean_cols (I suspect yours is
of type String at the moment and presents itself to Spark as a single
column names with commas embedded).

Not sure why the .drop call hangs but in either case drop returns a new
dataframe -- it's not a setter call

On Thu, Jul 16, 2015 at 10:57 AM, saif.a.ell...@wellsfargo.com wrote:

  Hi,

 In a hundred columns dataframe, I wish to either *select all of them
 except* or *drop the ones I dont want*.

 I am failing in doing such simple task, tried two ways

 val clean_cols = df.columns.filterNot(col_name =
 col_name.startWith(“STATE_”).mkString(“, “)
 df.select(clean_cols)

 But this throws exception:
 org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt,
 industry_area,...’
 at
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)

 The other thing I tried is

 df.columns.filter(col_name = col_name.startWith(“STATE_”)
 for (col - cols) df.drop(col)

 But this other thing doesn’t do anything or hangs up.

 Saif






Re: Select all columns except some

2015-07-16 Thread Lars Albertsson
The snippet at the end worked for me. We run Spark 1.3.x, so
DataFrame.drop is not available to us.

As pointed out by Yana, DataFrame operations typically return a new
DataFrame, so use as such:


import com.foo.sparkstuff.DataFrameOps._

...

val df = ...
val prunedDf = df.dropColumns(one_col, other_col)







package com.foo.sparkstuff

import org.apache.spark.sql.{Column, DataFrame}

import scala.language.implicitConversions

class PimpedDataFrame(frame: DataFrame) {
  /**
   * Drop named columns from dataframe. Replace with DataFrame.drop
when upgrading to Spark 1.4.0.
   */
  def dropColumns(toDrop: String*): DataFrame = {
val invalid = toDrop filterNot(frame.columns.contains(_))
if (invalid.nonEmpty) {
  throw new IllegalArgumentException(Columns not found:  +
invalid.mkString(,))
}
val newColumns = frame.columns filter {c = !toDrop.contains(c)}
map {new Column(_)}
frame.select(newColumns:_*)
  }
}

object DataFrameOps {
  implicit def pimpDataFrame(df: DataFrame): PimpedDataFrame = new
PimpedDataFrame(df)
}



On Thu, Jul 16, 2015 at 4:57 PM,  saif.a.ell...@wellsfargo.com wrote:
 Hi,

 In a hundred columns dataframe, I wish to either select all of them except
 or drop the ones I dont want.

 I am failing in doing such simple task, tried two ways

 val clean_cols = df.columns.filterNot(col_name =
 col_name.startWith(“STATE_”).mkString(“, “)
 df.select(clean_cols)

 But this throws exception:
 org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt,
 industry_area,...’
 at
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
 at
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
 at
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)

 The other thing I tried is

 df.columns.filter(col_name = col_name.startWith(“STATE_”)
 for (col - cols) df.drop(col)

 But this other thing doesn’t do anything or hangs up.

 Saif




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org