you can specify nullable in StructField On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@gmail.com> wrote:
> Hello there, > > I am using Spark 2.0.0 to create a parquet file using a text file with > Scala. I am trying to read a text file with bunch of values of type string > and long (mostly). And all the occurrences can be null. In order to support > nulls, all the values are boxed with Option (ex:- Option[String], > Option[Long]). > The schema for the parquet file is based on some external metadata file, > so I use 'StructField' to create a schema programmatically and perform some > code snippet like below... > > sc.textFile(flatFile).map(_.split(fileSeperator)).map { line => > convertToRawColumns(line, schemaSeq) > } > > ... > > val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line. > > On a side note, the same code used to work fine with Spark 1.6.2. > > Here is the error from Spark 2.0.0. > > Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jul 28, 2016 8:27:10 PM INFO: > org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28 > 20:27:10 [Executor task launch worker-483] ERROR Utils Aborting task > java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of string > if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row > object).isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) AS host#37315 > +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object).isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) > :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object).isNullAt > : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object) > : : +- input[0, org.apache.spark.sql.Row, true] > : +- 0 > :- null > +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) > +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType) > +- getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host) > +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top > level row object) > +- input[0, org.apache.spark.sql.Row, true] > > > Let me know if you would like me try to create a more simplified > reproducer to this problem. Perhaps I should not be using Option[T] for > nullable schema values? > > Please advice, > Muthu >