Hi Clark,
the problem is that in this dataset null values represented as NA
marker. Spark-csv doesn't have configurable null values marker (i've
made a PR with it some time ago:
https://github.com/databricks/spark-csv/pull/76).
So one option for you is to do post filtering, something like this:
val rv = allyears2k.filter("COLUMN != `NA`")
Thanks,
Peter Rudenko
On 2015-08-04 15:03, clark djilo kuissu wrote:
Hello,
I try to magage NA in this dataset. I import my dataset with the
com.databricks.spark.csv package
When I do this: allyears2k.na.drop() I have no result.
Can you help me please ?
Regards,
------------------- The dataset -------------------------
dataset: https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv
------------------- The code -------------------------
// Prepare environment
import sys.process._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val allyears2k =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("/home/clark/allyears2k.csv")
allyears2k.registerTempTable("allyears2k")
val rv = allyears2k.na.drop()