Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
>     String schemaString = "a b c";
>
>            List<StructField> fields = *new* ArrayList<StructField>();
>
>            MetadataBuilder mb = *new* MetadataBuilder();
>
>            mb.putBoolean("nullable", *true*);
>
>            Metadata m = mb.build();
>
>            *for* (String fieldName: schemaString.split(" ")) {
>
>                 fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>            }
>
>            StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD<Row>, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>



-- 
Best Regards,
Ayan Guha

Reply via email to