RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-28 Thread java8964
Hi, Lian:
Thanks for the information. It works as expect in the spark with this setting.
Yong

Subject: Re: Is this a Spark issue or Hive issue that Spark cannot read the 
string type data in the Parquet generated by Hive
To: java8...@hotmail.com; user@spark.apache.org
From: lian.cs@gmail.com
Date: Fri, 25 Sep 2015 14:42:55 -0700


  

  
  
Please set the the SQL option spark.sql.parquet.binaryAsString
to true when reading Parquet files containing strings generated by
Hive.



This is actually a bug of parquet-hive. When generating Parquet
schema for a string field, Parquet requires a "UTF8" annotation,
something like:



message hive_schema {

  ...

  optional binary column2 (UTF8);

  ...

}



but parquet-hive fails to add it, and produces:



message hive_schema {

  
  ...

  
  optional binary column2;

  
  ...

  
}

  
  

  Thus binary fields and string fields are made indistinguishable. 

  

  Interestingly, there's another bug in parquet-thrift, which always
  adds UTF8 annotation to all binary fields :)

  

  Cheng

  

  On 9/25/15 2:03 PM, java8964 wrote:



  
  Hi, Spark Users:



I have a problem related to Spark cannot recognize the
  string type in the Parquet schema generated by Hive.



Version of all components:



Spark 1.3.1
Hive 0.12.0
Parquet 1.3.2



I generated a detail low level table in the Parquet format
  using MapReduce java code. This table can be read in the Hive
  and Spark without any issue.



Now I create a Hive aggregation table like following:



create external table T (
column1 bigint,
column2 string,
..
)
partitioned by (dt string)

  ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
  OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
  location '/hdfs_location'




Then the table is populated in the Hive by:




  set hive.exec.compress.output=true;
  set parquet.compression=snappy;




insert into table T partition(dt='2015-09-23')
select 
.
from Detail_Table
group by 



After this, we can query the T table in the Hive without
  issue.



But if I try to use it in the Spark 1.3.1 like following:



import org.apache.spark.sql.SQLContext
val sqlContext = new
  org.apache.spark.sql.hive.HiveContext(sc)
val
  v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")




  scala> v_event_cnt.printSchema
  root
   |-- column1: long (nullable = true)
   |-- column2: binary (nullable =
true)
   |-- 
   |-- dt: string (nullable = true)




The Spark will recognize
column2 as binary type, instead of string type in this case,
but in the Hive, it works fine.
So this bring an issue that in the Spark, the data will be
  dumped as "[B@e353d68". To use it in the Spark, I have to cast
  it as string, to get the correct value out of it.



I wonder this mismatch type of Parquet file could be caused
  by which part? Is the Hive not generate the correct Parquet
  file with schema, or Spark in fact cannot recognize it due to problem 
in it. 


  
Is there a way I can do
either Hive or Spark to make this parquet schema correctly
on both ends?


  
Thanks


  
Yong
  


  

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true 
when reading Parquet files containing strings generated by Hive.


This is actually a bug of parquet-hive. When generating Parquet schema 
for a string field, Parquet requires a "UTF8" annotation, something like:


message hive_schema {
  ...
  optional binary column2 (UTF8);
  ...
}

but parquet-hive fails to add it, and produces:

message hive_schema {
  ...
  optional binary column2;
  ...
}

Thus binary fields and string fields are made indistinguishable.

Interestingly, there's another bug in parquet-thrift, which always adds 
UTF8 annotation to all binary fields :)


Cheng

On 9/25/15 2:03 PM, java8964 wrote:

Hi, Spark Users:

I have a problem related to Spark cannot recognize the string type in 
the Parquet schema generated by Hive.


Version of all components:

Spark 1.3.1
Hive 0.12.0
Parquet 1.3.2

I generated a detail low level table in the Parquet format using 
MapReduce java code. This table can be read in the Hive and Spark 
without any issue.


Now I create a Hive aggregation table like following:

create external table T (
column1 bigint,
*column2 string,*
..
)
partitioned by (dt string)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
location '/hdfs_location'

Then the table is populated in the Hive by:

set hive.exec.compress.output=true;
set parquet.compression=snappy;

insert into table T partition(dt='2015-09-23')
select
.
from Detail_Table
group by

After this, we can query the T table in the Hive without issue.

But if I try to use it in the Spark 1.3.1 like following:

import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")

scala> v_event_cnt.printSchema
root
 |-- column1: long (nullable = true)
* |-- column2: binary (nullable = true)*
 |-- 
 |-- dt: string (nullable = true)

The Spark will recognize column2 as binary type, instead of string 
type in this case, but in the Hive, it works fine.
So this bring an issue that in the Spark, the data will be dumped as 
"[B@e353d68". To use it in the Spark, I have to cast it as string, to 
get the correct value out of it.


I wonder this mismatch type of Parquet file could be caused by which 
part? Is the Hive not generate the correct Parquet file with schema, 
or Spark in fact cannot recognize it due to problem in it.


Is there a way I can do either Hive or Spark to make this parquet 
schema correctly on both ends?


Thanks

Yong




Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 
0.14.0. So the SQL option I mentioned is mostly used for reading legacy 
Parquet files generated by older versions of Hive.


Cheng

On 9/25/15 2:42 PM, Cheng Lian wrote:
Please set the the SQL option spark.sql.parquet.binaryAsString to true 
when reading Parquet files containing strings generated by Hive.


This is actually a bug of parquet-hive. When generating Parquet schema 
for a string field, Parquet requires a "UTF8" annotation, something like:


message hive_schema {
  ...
  optional binary column2 (UTF8);
  ...
}

but parquet-hive fails to add it, and produces:

message hive_schema {
  ...
  optional binary column2;
  ...
}

Thus binary fields and string fields are made indistinguishable.

Interestingly, there's another bug in parquet-thrift, which always 
adds UTF8 annotation to all binary fields :)


Cheng

On 9/25/15 2:03 PM, java8964 wrote:

Hi, Spark Users:

I have a problem related to Spark cannot recognize the string type in 
the Parquet schema generated by Hive.


Version of all components:

Spark 1.3.1
Hive 0.12.0
Parquet 1.3.2

I generated a detail low level table in the Parquet format using 
MapReduce java code. This table can be read in the Hive and Spark 
without any issue.


Now I create a Hive aggregation table like following:

create external table T (
column1 bigint,
*column2 string,*
..
)
partitioned by (dt string)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
location '/hdfs_location'

Then the table is populated in the Hive by:

set hive.exec.compress.output=true;
set parquet.compression=snappy;

insert into table T partition(dt='2015-09-23')
select
.
from Detail_Table
group by

After this, we can query the T table in the Hive without issue.

But if I try to use it in the Spark 1.3.1 like following:

import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")

scala> v_event_cnt.printSchema
root
 |-- column1: long (nullable = true)
* |-- column2: binary (nullable = true)*
 |-- 
 |-- dt: string (nullable = true)

The Spark will recognize column2 as binary type, instead of string 
type in this case, but in the Hive, it works fine.
So this bring an issue that in the Spark, the data will be dumped as 
"[B@e353d68". To use it in the Spark, I have to cast it as string, to 
get the correct value out of it.


I wonder this mismatch type of Parquet file could be caused by which 
part? Is the Hive not generate the correct Parquet file with schema, 
or Spark in fact cannot recognize it due to problem in it.


Is there a way I can do either Hive or Spark to make this parquet 
schema correctly on both ends?


Thanks

Yong