Hi Clark,
the problem is that in this dataset null values represented as NA marker. Spark-csv doesn't have configurable null values marker (i've made a PR with it some time ago: https://github.com/databricks/spark-csv/pull/76).

So one option for you is to do post filtering, something like this:

val rv = allyears2k.filter("COLUMN != `NA`")

Thanks,
Peter Rudenko
On 2015-08-04 15:03, clark djilo kuissu wrote:
Hello,

I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package

When I do this: allyears2k.na.drop() I have no result.

Can you help me please ?

Regards,

------------------- The dataset -------------------------

dataset: https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv

-------------------   The code -------------------------

// Prepare environment
import sys.process._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._


val allyears2k = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/clark/allyears2k.csv")
allyears2k.registerTempTable("allyears2k")

val rv = allyears2k.na.drop()


Reply via email to