you can specify nullable in StructField

On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar <bablo...@gmail.com> wrote:

> Hello there,
>
> I am using Spark 2.0.0 to create a parquet file using a text file with
> Scala. I am trying to read a text file with bunch of values of type string
> and long (mostly). And all the occurrences can be null. In order to support
> nulls, all the values are boxed with Option (ex:- Option[String],
> Option[Long]).
> The schema for the parquet file is based on some external metadata file,
> so I use 'StructField' to create a schema programmatically and perform some
> code snippet like below...
>
> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>   convertToRawColumns(line, schemaSeq)
> }
>
> ...
>
> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>
> On a side note, the same code used to work fine with Spark 1.6.2.
>
> Here is the error from Spark 2.0.0.
>
> Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Jul 28, 2016 8:27:10 PM INFO:
> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
> java.lang.RuntimeException: Error while encoding:
> java.lang.RuntimeException: scala.Some is not a valid external type for
> schema of string
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
> object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true) AS host#37315
> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>    :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt
>    :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object)
>    :  :  +- input[0, org.apache.spark.sql.Row, true]
>    :  +- 0
>    :- null
>    +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
> StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>       +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType)
>          +- getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host)
>             +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
> level row object)
>                +- input[0, org.apache.spark.sql.Row, true]
>
>
> Let me know if you would like me try to create a more simplified
> reproducer to this problem. Perhaps I should not be using Option[T] for
> nullable schema values?
>
> Please advice,
> Muthu
>

Reply via email to