long to timestamp conversion failing in hive for parquet file

Ankit Kailaswar Wed, 18 Jan 2017 08:51:45 -0800

Hi,

I have a query regarding thrift parquet writer. I have a mr job that writes
output data in parquet format using "ThriftParquetWriter". It is consumed
by corresponding hive table. while writing data I am writing it as long or
binary but and reading it in hive table as timestamp. In both cases I am
getting error for type conversion.


*thrift schema*:
struct sample1 {
    1: optional i32 col1,
    2: optional i64 col2,
    }
struct sample2 {
    1: optional i32 col1,
    2: optional binary col2,
    }
corresponding hive tables,
*sample1*
col_name            data_type
col1                    int
col2                    timestamp

*sample2*
col_name            data_type
col1                    int
col2                    timestamp


*hive query*
* select col1,col2 from sample1;*

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
        at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:773)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
        at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
        at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:149)
        at 
org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:489)
        ... 9 more
Caused by: java.lang.ClassCastException:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.hive.serde2.io.TimestampWritable
        at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.copyObject(WritableTimestampObjectInspector.java:43)
        at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:380)
        at 
org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:152)
        at 
org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:144)
        at 
org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.copyKey(KeyWrapperFactory.java:121)
        at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:786)
        at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:700)
        at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:768)

*select col1,col2 from sample2;*
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.hive.serde2.io.TimestampWritabl

*Storage Information    *
SerDe Library:
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            0
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
    serialization.format    1

One of the workaround while writing data can be to have col2 as long in
hive table and query like "select col1, timestamp(col2)" but we want to
avoid that.

While writing data I am getting an message as,
 "INFO: org.apache.parquet.hadoop.thrift.AbstractThriftWriteSupport: Pig is
not loaded, pig metadata will not be written"

code snippet,
public static void writeS1(String fileName) throws Exception {

    sample1 obj1 = new sample1();
    sample1 obj2 = new sample1();
    sample1 obj3 = new sample1();

    obj1.col1 = 1;
    obj2.col1 = 2;
    obj3.col1 = 3;

    obj1.col2 = 1483228801;
    obj2.col2 = 1483228801;
    obj3.col2 = 1483228801;

    Path f = new Path(fileName);
    ThriftParquetWriter<sample1> thriftParquetWriter = new
ThriftParquetWriter<sample1>(f, sample1.class,
CompressionCodecName.UNCOMPRESSED);
    thriftParquetWriter.write(obj1);
    thriftParquetWriter.write(obj2);
    thriftParquetWriter.write(obj3);
    thriftParquetWriter.close();

}

public static void writeS2(String fileName) throws Exception  {

    sample2 obj1 = new sample2();
    sample2 obj2 = new sample2();
    sample2 obj3 = new sample2();

    obj1.col1 = 1;
    obj2.col1 = 2;
    obj3.col1 = 3;

    obj1.col2 = ByteBuffer.allocate(Long.BYTES).putLong(1483228801);
    obj2.col2 = ByteBuffer.allocate(Long.BYTES).putLong(1483228801);
    obj3.col2 = ByteBuffer.allocate(Long.BYTES).putLong(1483228801);

    Path f = new Path(fileName);
    ThriftParquetWriter<sample2> thriftParquetWriter = new
ThriftParquetWriter<sample2>(f, sample2.class,
CompressionCodecName.UNCOMPRESSED);
    thriftParquetWriter.write(obj1);
    thriftParquetWriter.write(obj2);
    thriftParquetWriter.write(obj3);
    thriftParquetWriter.close();

}


schema in parquet files is,
*sample1.parquet*
{
  "id" : "STRUCT",
  "children" : [ {
    "name" : "col1",
    "fieldId" : 1,
    "requirement" : "OPTIONAL",
    "type" : {
      "id" : "I32"
    }
  }, {
    "name" : "col2",
    "fieldId" : 2,
    "requirement" : "OPTIONAL",
    "type" : {
      "id" : "I64"
    }
  } ],
  "structOrUnionType" : "STRUCT"
}

*sample2.parquet*
{
  "id" : "STRUCT",
  "children" : [ {
    "name" : "col1",
    "fieldId" : 1,
    "requirement" : "OPTIONAL",
    "type" : {
      "id" : "I32"
    }
  }, {
    "name" : "col2",
    "fieldId" : 2,
    "requirement" : "OPTIONAL",
    "type" : {
      "id" : "STRING"
    }
  } ],
  "structOrUnionType" : "STRUCT"
}

I have attached parquet files.

Please let me know if I am missing something or doing it wrongly or is it
known issue or it has some workaround.

-Ankit

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

long to timestamp conversion failing in hive for parquet file

Reply via email to