RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

java8964 Mon, 28 Sep 2015 06:09:14 -0700

Hi, Lian:
Thanks for the information. It works as expect in the spark with this setting.
Yong


Subject: Re: Is this a Spark issue or Hive issue that Spark cannot read the 
string type data in the Parquet generated by Hive
To: java8...@hotmail.com; user@spark.apache.org
From: lian.cs....@gmail.com
Date: Fri, 25 Sep 2015 14:42:55 -0700


  
    
  
  
    Please set the the SQL option spark.sql.parquet.binaryAsString
    to true when reading Parquet files containing strings generated by
    Hive.

    

    This is actually a bug of parquet-hive. When generating Parquet
    schema for a string field, Parquet requires a "UTF8" annotation,
    something like:

    

    message hive_schema {

      ...

      optional binary column2 (UTF8);

      ...

    }

    

    but parquet-hive fails to add it, and produces:

    

    message hive_schema {

      
          ...

      
          optional binary column2;

      
          ...

      
        }

      
      

      Thus binary fields and string fields are made indistinguishable. 

      

      Interestingly, there's another bug in parquet-thrift, which always
      adds UTF8 annotation to all binary fields :)

      

      Cheng

      

      On 9/25/15 2:03 PM, java8964 wrote:

    
    
      
      Hi, Spark Users:
        

        
        I have a problem related to Spark cannot recognize the
          string type in the Parquet schema generated by Hive.
        

        
        Version of all components:
        

        
        Spark 1.3.1
        Hive 0.12.0
        Parquet 1.3.2
        

        
        I generated a detail low level table in the Parquet format
          using MapReduce java code. This table can be read in the Hive
          and Spark without any issue.
        

        
        Now I create a Hive aggregation table like following:
        

        
        create external table T (
            column1 bigint,
            column2 string,
            ..............
        )
        partitioned by (dt string)
        
          ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
          STORED AS
          INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
          OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
          location '/hdfs_location'
        
        

        
        Then the table is populated in the Hive by:
        

        
        
          set hive.exec.compress.output=true;
          set parquet.compression=snappy;
        
        

        
        insert into table T partition(dt='2015-09-23')
        select 
            .............
        from Detail_Table
        group by 
        

        
        After this, we can query the T table in the Hive without
          issue.
        

        
        But if I try to use it in the Spark 1.3.1 like following:
        

        
        import org.apache.spark.sql.SQLContext
        val sqlContext = new
          org.apache.spark.sql.hive.HiveContext(sc)
        val
          v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")
        

        
        
          scala> v_event_cnt.printSchema
          root
           |-- column1: long (nullable = true)
           |-- column2: binary (nullable =
                true)
           |-- ............
           |-- dt: string (nullable = true)
        
        

        
        The Spark will recognize
            column2 as binary type, instead of string type in this case,
            but in the Hive, it works fine.
        So this bring an issue that in the Spark, the data will be
          dumped as "[B@e353d68". To use it in the Spark, I have to cast
          it as string, to get the correct value out of it.
        

        
        I wonder this mismatch type of Parquet file could be caused
          by which part? Is the Hive not generate the correct Parquet
          file with schema, or Spark in fact cannot recognize it due to problem 
in it. 
        

          
        Is there a way I can do
            either Hive or Spark to make this parquet schema correctly
            on both ends?
        

          
        Thanks
        

          
        Yong

RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

Reply via email to