Thanks for the tip. Any idea why the intuitive answer doesn't work ( != None)? I inspected the Row columns and they do indeed have a None value. I would suspect that somehow Python's None is translated to something in jvm which doesn't equal to null?
I might check out the source code for a better idea as well Pedro On Wed, Jul 1, 2015 at 12:18 PM, Michael Armbrust <mich...@databricks.com> wrote: > There is an isNotNull function on any column. > > df._1.isNotNull > > or > > from pyspark.sql.functions import * > col("myColumn").isNotNull > > On Wed, Jul 1, 2015 at 3:07 AM, Olivier Girardot <ssab...@gmail.com> > wrote: > >> I must admit I've been using the same "back to SQL" strategy for now :p >> So I'd be glad to have insights into that too. >> >> Le mar. 30 juin 2015 à 23:28, pedro <ski.rodrig...@gmail.com> a écrit : >> >>> I am trying to find what is the correct way to programmatically check for >>> null values for rows in a dataframe. For example, below is the code using >>> pyspark and sql: >>> >>> df = sqlContext.createDataFrame(sc.parallelize([(1, None), (2, "a"), (3, >>> "b"), (4, None)])) >>> df.where('_2 is not null').count() >>> >>> However, this won't work >>> df.where(df._2 != None).count() >>> >>> It seems there is no native Python way with DataFrames to do this, but I >>> find that difficult to believe and more likely that I am missing the >>> "right >>> way" to do this. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Check-for-null-in-PySpark-DataFrame-tp23553.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > -- Pedro Rodriguez UCBerkeley 2014 | Computer Science SnowGeek <http://SnowGeek.org> pedro-rodriguez.com ski.rodrig...@gmail.com 208-340-1703