I believe your looking for df.na.fill in scala, in pySpark Module it is fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)
from the docs: df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10 80 Alice5 null Bob50 null Tom50 null unknown On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > Hi, > > > > Thanks for the response. But I could not see fillna function in DataFrame > class. > > > > > > > > Is it available in some specific version of Spark sql. This is what I have > in my pom.xml > > > > <dependency> > > <groupId>org.apache.spark</groupId> > > <artifactId>spark-sql_2.10</artifactId> > > <version>1.3.1</version> > > </dependency> > > > > Regards, > > Anand.C > > > > *From:* ayan guha [mailto:guha.a...@gmail.com] > *Sent:* Monday, May 18, 2015 5:19 PM > *To:* Chandra Mohan, Ananda Vel Murugan; user > *Subject:* Re: Spark sql error while writing Parquet file- Trying to > write more fields than contained in row > > > > Hi > > > > Give a try with dtaFrame.fillna function to fill up missing column > > > > Best > > Ayan > > > > On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan < > ananda.muru...@honeywell.com> wrote: > > Hi, > > > > I am using spark-sql to read a CSV file and write it as parquet file. I am > building the schema using the following code. > > > > String schemaString = "a b c"; > > List<StructField> fields = *new* ArrayList<StructField>(); > > MetadataBuilder mb = *new* MetadataBuilder(); > > mb.putBoolean("nullable", *true*); > > Metadata m = mb.build(); > > *for* (String fieldName: schemaString.split(" ")) { > > fields.add(*new* StructField(fieldName,DataTypes. > *DoubleType*,*true*, m)); > > } > > StructType schema = DataTypes.*createStructType*(fields); > > > > Some of the rows in my input csv does not contain three columns. After > building my JavaRDD<Row>, I create data frame as shown below using the > RDD and schema. > > > > DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema); > > > > Finally I try to save it as Parquet file > > > > darDataFrame.saveAsParquetFile("/home/anand/output.parquet”) > > > > I get this error when saving it as Parquet file > > > > java.lang.IndexOutOfBoundsException: Trying to write more fields than > contained in row (3 > 2) > > > > I understand the reason behind this error. Some of my rows in Row RDD does > not contain three elements as some rows in my input csv does not contain > three columns. But while building the schema, I am specifying every field > as nullable. So I believe, it should not throw this error. Can anyone help > me fix this error. Thank you. > > > > Regards, > > Anand.C > > > > > > > > > > -- > > Best Regards, > Ayan Guha >