Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
Hello Dong Meng,

Thanks for the tip. But, I do have code in place that looks like this...

StructField(columnName, getSparkDataType(dataType), nullable = true)

May be I am missing something else. The same code works fine with Spark
1.6.2 though. On a side note, I could be using SparkSession, but i don't
know how to split and map the row elegantly. Hence using it as RDD.

Thanks,
Muthu


On Thu, Jul 28, 2016 at 10:47 PM, Dong Meng  wrote:

> you can specify nullable in StructField
>
> On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar 
> wrote:
>
>> Hello there,
>>
>> I am using Spark 2.0.0 to create a parquet file using a text file with
>> Scala. I am trying to read a text file with bunch of values of type string
>> and long (mostly). And all the occurrences can be null. In order to support
>> nulls, all the values are boxed with Option (ex:- Option[String],
>> Option[Long]).
>> The schema for the parquet file is based on some external metadata file,
>> so I use 'StructField' to create a schema programmatically and perform some
>> code snippet like below...
>>
>> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>>   convertToRawColumns(line, schemaSeq)
>> }
>>
>> ...
>>
>> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>>
>> On a side note, the same code used to work fine with Spark 1.6.2.
>>
>> Here is the error from Spark 2.0.0.
>>
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
>> Jul 28, 2016 8:27:10 PM INFO:
>> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
>> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
>> java.lang.RuntimeException: Error while encoding:
>> java.lang.RuntimeException: scala.Some is not a valid external type for
>> schema of string
>> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
>> object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true) AS host#37315
>> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt) null else staticinvoke(class
>> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
>> row object).isNullAt
>>:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
>> level row object)
>>:  :  +- input[0, org.apache.spark.sql.Row, true]
>>:  +- 0
>>:- null
>>+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
>> StringType, fromString,
>> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType), true)
>>   +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host),
>> StringType)
>>  +- getexternalrowfield(assertnotnull(input[0,
>> org.apache.spark.sql.Row, true], top level row object), 0, host)
>> +- assertnotnull(input[0, org.apache.spark.sql.Row, true],
>> top level row object)
>>+- input[0, org.apache.spark.sql.Row, true]
>>
>>
>> Let me know if you would like me try to create a more simplified
>> reproducer to this problem. Perhaps I should not be using Option[T] for
>> nullable schema values?
>>
>> Please advice,
>> Muthu
>>
>
>


Re: Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Dong Meng
you can specify nullable in StructField

On Thu, Jul 28, 2016 at 9:14 PM, Muthu Jayakumar  wrote:

> Hello there,
>
> I am using Spark 2.0.0 to create a parquet file using a text file with
> Scala. I am trying to read a text file with bunch of values of type string
> and long (mostly). And all the occurrences can be null. In order to support
> nulls, all the values are boxed with Option (ex:- Option[String],
> Option[Long]).
> The schema for the parquet file is based on some external metadata file,
> so I use 'StructField' to create a schema programmatically and perform some
> code snippet like below...
>
> sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
>   convertToRawColumns(line, schemaSeq)
> }
>
> ...
>
> val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.
>
> On a side note, the same code used to work fine with Spark 1.6.2.
>
> Here is the error from Spark 2.0.0.
>
> Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Jul 28, 2016 8:27:10 PM INFO:
> org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
> 20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
> java.lang.RuntimeException: Error while encoding:
> java.lang.RuntimeException: scala.Some is not a valid external type for
> schema of string
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
> object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true) AS host#37315
> +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt) null else staticinvoke(class
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object).isNullAt
>:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
> row object)
>:  :  +- input[0, org.apache.spark.sql.Row, true]
>:  +- 0
>:- null
>+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
> StringType, fromString,
> validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType), true)
>   +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host),
> StringType)
>  +- getexternalrowfield(assertnotnull(input[0,
> org.apache.spark.sql.Row, true], top level row object), 0, host)
> +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
> level row object)
>+- input[0, org.apache.spark.sql.Row, true]
>
>
> Let me know if you would like me try to create a more simplified
> reproducer to this problem. Perhaps I should not be using Option[T] for
> nullable schema values?
>
> Please advice,
> Muthu
>


Question / issue while creating a parquet file using a text file with spark 2.0...

2016-07-28 Thread Muthu Jayakumar
Hello there,

I am using Spark 2.0.0 to create a parquet file using a text file with
Scala. I am trying to read a text file with bunch of values of type string
and long (mostly). And all the occurrences can be null. In order to support
nulls, all the values are boxed with Option (ex:- Option[String],
Option[Long]).
The schema for the parquet file is based on some external metadata file, so
I use 'StructField' to create a schema programmatically and perform some
code snippet like below...

sc.textFile(flatFile).map(_.split(fileSeperator)).map { line =>
  convertToRawColumns(line, schemaSeq)
}

...

val df = sqlContext.createDataFrame(rawFileRDD, schema) //Failed line.

On a side note, the same code used to work fine with Spark 1.6.2.

Here is the error from Spark 2.0.0.

Jul 28, 2016 8:27:10 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
Compression: SNAPPY
Jul 28, 2016 8:27:10 PM INFO:
org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size Jul-28
20:27:10 [Executor task launch worker-483] ERROR   Utils Aborting task
java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.Some is not a valid external type for
schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true) AS host#37315
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object).isNullAt) null else staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row
object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level
row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
StringType, fromString,
validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType), true)
  +- validateexternaltype(getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host),
StringType)
 +- getexternalrowfield(assertnotnull(input[0,
org.apache.spark.sql.Row, true], top level row object), 0, host)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top
level row object)
   +- input[0, org.apache.spark.sql.Row, true]


Let me know if you would like me try to create a more simplified reproducer
to this problem. Perhaps I should not be using Option[T] for nullable
schema values?

Please advice,
Muthu