Hi Give a try with dtaFrame.fillna function to fill up missing column
Best Ayan On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > Hi, > > > > I am using spark-sql to read a CSV file and write it as parquet file. I am > building the schema using the following code. > > > > String schemaString = "a b c"; > > List<StructField> fields = *new* ArrayList<StructField>(); > > MetadataBuilder mb = *new* MetadataBuilder(); > > mb.putBoolean("nullable", *true*); > > Metadata m = mb.build(); > > *for* (String fieldName: schemaString.split(" ")) { > > fields.add(*new* StructField(fieldName,DataTypes. > *DoubleType*,*true*, m)); > > } > > StructType schema = DataTypes.*createStructType*(fields); > > > > Some of the rows in my input csv does not contain three columns. After > building my JavaRDD<Row>, I create data frame as shown below using the > RDD and schema. > > > > DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema); > > > > Finally I try to save it as Parquet file > > > > darDataFrame.saveAsParquetFile("/home/anand/output.parquet”) > > > > I get this error when saving it as Parquet file > > > > java.lang.IndexOutOfBoundsException: Trying to write more fields than > contained in row (3 > 2) > > > > I understand the reason behind this error. Some of my rows in Row RDD does > not contain three elements as some rows in my input csv does not contain > three columns. But while building the schema, I am specifying every field > as nullable. So I believe, it should not throw this error. Can anyone help > me fix this error. Thank you. > > > > Regards, > > Anand.C > > > > > -- Best Regards, Ayan Guha