[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API.

I have a DF which came from the csv package.

1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF

2. I am trying to use the syntax used in various blog posts but can't
figure out how to reference a column by name:

scala df.filter(customer_id!=)
console:23: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame and
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (Boolean)
  df.filter(customer_id!=)

​
3. what would be the recommended way to drop a row containing a null value
-- is it possible to do this:
scala df.filter(customer_id IS NOT NULL)


Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example,
df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2).
Or, df.select(df(colA).cast(int), ...)

On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com
wrote:

 val df = Seq((test, 1)).toDF(col1, col2)

 You can use SQL style expressions as a string:

 df.filter(col1 IS NOT NULL).collect()
 res1: Array[org.apache.spark.sql.Row] = Array([test,1])

 Or you can also reference columns using df(colName) or quot;colName or
 col(colName)

 df.filter(df(col1) === test).collect()
 res2: Array[org.apache.spark.sql.Row] = Array([test,1])

 On Thu, Apr 2, 2015 at 7:45 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi folks, having some seemingly noob issues with the dataframe API.

 I have a DF which came from the csv package.

 1. What would be an easy way to cast a column to a given type -- my DF
 columns are all typed as strings coming from a csv. I see a schema getter
 but not setter on DF

 2. I am trying to use the syntax used in various blog posts but can't
 figure out how to reference a column by name:

 scala df.filter(customer_id!=)
 console:23: error: overloaded method value filter with alternatives:
   (conditionExpr: String)org.apache.spark.sql.DataFrame and
   (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
  cannot be applied to (Boolean)
   df.filter(customer_id!=)

 ​
 3. what would be the recommended way to drop a row containing a null
 value -- is it possible to do this:
 scala df.filter(customer_id IS NOT NULL)