Hi, I have one question related to Spark-Avro, not sure if here is the best 
place to ask.
I have the following Scala Case class, populated with the data in the Spark 
application, and I tried to save it as AVRO format in the HDFS
case class Claim (  ......)
case class Coupon (  account_id: Long  ........  claims: List[Claim])
As the above example, the Coupon case class contains List of Claim class.
In the RDD, it holds an Iterator of Coupon data, and I will try to save it into 
the HDFS. I am using Spark 1.3.1, with Spark-Avro 1.0.0 (which matches with 
Spark 1.3.x)
rdd.toDF.save("hdfs_location", "com.databricks.spark.avro")
I have no problem to save the data this way, but the problem is that I cannot 
use the avro data in Hive.
Here is the schema example generated by Spark AVRO for the above data:
{
   "type":"record",
   "name":"topLevelRecord",
   "fields":[{
   "name":"account_id",
   "type":"long"
},........{"name":"claims",
"type":[
   {
      "type":"array",
      "items":[
         {
            "type":"record",
            "name":"claims",
            "fields":[
......
The claims field is generated as an union contains array, instead of array of 
structure directly.Or for more clearly, here is the schema in the hive when 
pointing to the data generated by Spark-Avro:desc tableOK
col_name        data_type       comment
account_id              bigint                  from deserializer.......claims  
                uniontype<array<uniontype<struct<account_id:bigint, .......>>> 
from deserializerObviously, this causes trouble for Hive to query this data (at 
least in the Hive 0.12, which we are currently use), so end user cannot query 
it in the hive like "select claims[0].account_id from table".
I wonder why Spark-Avro has to wrapping a union structure in this case, instead 
of just building "array<struct>"?Or better, is there a way I can control the 
AVRO generated in this case in Spark-AVOR?ThanksYong
                                          

Reply via email to