Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

Todd Nist Tue, 19 May 2015 05:47:39 -0700

I believe your looking for  df.na.fill in scala, in pySpark Module it is
fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)


from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10  80
    Alice5   null   Bob50  null   Tom50  null   unknown


On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> Thanks for the response. But I could not see fillna function in DataFrame
> class.
>
>
>
>
>
>
>
> Is it available in some specific version of Spark sql. This is what I have
> in my pom.xml
>
>
>
> <dependency>
>
>               <groupId>org.apache.spark</groupId>
>
>               <artifactId>spark-sql_2.10</artifactId>
>
>               <version>1.3.1</version>
>
>        </dependency>
>
>
>
> Regards,
>
> Anand.C
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Monday, May 18, 2015 5:19 PM
> *To:* Chandra Mohan, Ananda Vel Murugan; user
> *Subject:* Re: Spark sql error while writing Parquet file- Trying to
> write more fields than contained in row
>
>
>
> Hi
>
>
>
> Give a try with dtaFrame.fillna function to fill up missing column
>
>
>
> Best
>
> Ayan
>
>
>
> On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
> ananda.muru...@honeywell.com> wrote:
>
> Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
>     String schemaString = "a b c";
>
>            List<StructField> fields = *new* ArrayList<StructField>();
>
>            MetadataBuilder mb = *new* MetadataBuilder();
>
>            mb.putBoolean("nullable", *true*);
>
>            Metadata m = mb.build();
>
>            *for* (String fieldName: schemaString.split(" ")) {
>
>                 fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>            }
>
>            StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD<Row>, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

Reply via email to