[jira] [Commented] (HIVE-6784) parquet-hive should allow column type change

Justin Coffey (JIRA) Fri, 11 Apr 2014 01:52:24 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966341#comment-13966341
 ]


Justin Coffey commented on HIVE-6784:
-------------------------------------

You've cited a "lazy" serde.  Parquet is not "lazy".  It is similar to ORC.

Have a look ORC's deserialize() method 
(org.apache.hadoop.hive.ql.io.orc.OrcSerde):
{code}
  @Override
  public Object deserialize(Writable writable) throws SerDeException {
    return writable;
  }
{code}

A quick look through ORC code indicates to me that they don't do any reparsing 
(though I might have missed something).

Looking through other serde's not a single one (that I checked) reparses 
values.  Value parsing is handled in ObjectInspectors (poke around 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils).

In my opinion, the *substantial* performance penalty that you are introducing 
with this patch is going to be a much bigger negative to adopting parquet than 
obliging people to rebuild their data set in the rare event that you have to 
change a type.

And if you do need to change a type, insert overwrite table is a good work 
around.

-1

> parquet-hive should allow column type change
> --------------------------------------------
>
>                 Key: HIVE-6784
>                 URL: https://issues.apache.org/jira/browse/HIVE-6784
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, Serializers/Deserializers
>    Affects Versions: 0.13.0
>            Reporter: Tongjie Chen
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6784.1.patch.txt, HIVE-6784.2.patch.txt
>
>
> see also in the following parquet issue:
> https://github.com/Parquet/parquet-mr/issues/323
> Currently, if we change parquet format hive table using "alter table 
> parquet_table change c1 c1 bigint " ( assuming original type of c1 is int), 
> it will result in exception thrown from SerDe: 
> "org.apache.hadoop.io.IntWritable cannot be cast to 
> org.apache.hadoop.io.LongWritable" in query runtime.
> This is different behavior from hive (using other file format), where it will 
> try to perform cast (null value in case of incompatible type).
> Parquet Hive's RecordReader returns an ArrayWritable (based on schema stored 
> in footers of parquet files); ParquetHiveSerDe also creates an corresponding 
> ArrayWritableObjectInspector (but using column type info from metastore). 
> Whenever there is column type change, the objector inspector will throw 
> exception, since WritableLongObjectInspector cannot inspect an IntWritable 
> etc...
> Conversion has to happen somewhere if we want to allow type change. SerDe's 
> deserialize method seems a natural place for it.
> Currently, serialize method calls createStruct (then createPrimitive) for 
> every record, but it creates a new object regardless, which seems expensive. 
> I think that could be optimized a bit by just returning the object passed if 
> already of the right type. deserialize also reuse this method, if there is a 
> type change, there will be new object to be created, which I think is 
> inevitable. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6784) parquet-hive should allow column type change

Reply via email to