Re: Question / issue while creating a parquet file using a text file with spark 2.0...
Hello Dong Meng, Thanks for the tip. But, I do have code in place that looks like this... StructField(columnName, getSparkDataType(dataType), nullable = true) May be I am missing something else. The same code works fine with Spark 1.6.2 though. On a side note, I could be using SparkSession, but i don't know how to split and map the row elegantly. Hence using it as RDD. Thanks, Muthu On Thu, Jul 28, 2016 at 10:47 PM, Dong Mengwrote: > you can specify nullable in StructField > > On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar > wrote: > >> Hello there, >> >> I am using Spark 2.0.0 to create a parquet file using a text file with >> Scala. I am trying to read a text file with bunch of values of type string >> and long (mostly). And all the occurrences can be null. In order to support >> nulls, all the values are boxed with Option (ex:- Option[String], >> Option[Long]). >> The schema for the parquet file is based on some external metadata file, >> so I use 'StructField' to create a schema programmatically and perform some >> code snippet like below... >> >> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line => >> convertToRawColumns(line, schemaSeq) >> } >> >> ... >> >> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line. >> >> On a side note, the same code used to work fine with Spark 1.6.2. >> >> Here is the error from Spark 2.0.0. >> >> Jul 28, 2016 8:27:10 PM INFO: >> org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY >> Jul 28, 2016 8:27:10 PM INFO: >> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28 >> 20:27:10 [Executor task launch worker-483] ERROR Utils Aborting task >> java.lang.RuntimeException: Error while encoding: >> java.lang.RuntimeException: scala.Some is not a valid external type for >> schema of string >> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row >> object).isNullAt) null else staticinvoke(class >> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) AS host#37315 >> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level >> row object).isNullAt) null else staticinvoke(class >> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) >>:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level >> row object).isNullAt >>: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top >> level row object) >>: : +- input[0, org.apache.spark.sql.Row, true] >>: +- 0 >>:- null >>+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, >> StringType, fromString, >> validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType), true) >> +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host), >> StringType) >> +- getexternalrowfield(assertnotnull(input[0, >> org.apache.spark.sql.Row, true], top level row object), 0, host) >> +- assertnotnull(input[0, org.apache.spark.sql.Row, true], >> top level row object) >>+- input[0, org.apache.spark.sql.Row, true] >> >> >> Let me know if you would like me try to create a more simplified >> reproducer to this problem. Perhaps I should not be using Option[T] for >> nullable schema values? >> >> Please advice, >> Muthu >> > >
Re: Question / issue while creating a parquet file using a text file with spark 2.0...
you can specify nullable in StructField On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumarwrote: > Hello there, > > I am using Spark 2.0.0 to create a parquet file using a text file with > Scala. I am trying to read a text file with bunch of values of type string > and long (mostly). And all the occurrences can be null. In order to support > nulls, all the values are boxed with Option (ex:- Option[String], > Option[Long]). > The schema for the parquet file is based on some external metadata file, > so I use 'StructField' to create a schema programmatically and perform some > code snippet like below... > > sc.textFile(flatFile).map(_.split(fileSeperator)).map { line => > convertToRawColumns(line, schemaSeq) > } > > ... > > val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line. > > On a side note, the same code used to work fine with Spark 1.6.2. > > Here is the error from Spark 2.0.0. > > Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jul 28, 2016 8:27:10 PM INFO: > org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28 > 20:27:10 [Executor task launch worker-483] ERROR Utils Aborting task > java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of string > if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row > object).isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) AS host#37315 > +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object).isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) >:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object).isNullAt >: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level > row object) >: : +- input[0, org.apache.spark.sql.Row, true] >: +- 0 >:- null >+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, > validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType), true) > +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host), > StringType) > +- getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true], top level row object), 0, host) > +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top > level row object) >+- input[0, org.apache.spark.sql.Row, true] > > > Let me know if you would like me try to create a more simplified > reproducer to this problem. Perhaps I should not be using Option[T] for > nullable schema values? > > Please advice, > Muthu >
Question / issue while creating a parquet file using a text file with spark 2.0...
Hello there, I am using Spark 2.0.0 to create a parquet file using a text file with Scala. I am trying to read a text file with bunch of values of type string and long (mostly). And all the occurrences can be null. In order to support nulls, all the values are boxed with Option (ex:- Option[String], Option[Long]). The schema for the parquet file is based on some external metadata file, so I use 'StructField' to create a schema programmatically and perform some code snippet like below... sc.textFile(flatFile).map(_.split(fileSeperator)).map { line => convertToRawColumns(line, schemaSeq) } ... val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line. On a side note, the same code used to work fine with Spark 1.6.2. Here is the error from Spark 2.0.0. Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28 20:27:10 [Executor task launch worker-483] ERROR Utils Aborting task java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of string if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, host), StringType), true) AS host#37315 +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, host), StringType), true) :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true] : +- 0 :- null +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, host), StringType), true) +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, host), StringType) +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, host) +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true] Let me know if you would like me try to create a more simplified reproducer to this problem. Perhaps I should not be using Option[T] for nullable schema values? Please advice, Muthu