[
https://issues.apache.org/jira/browse/HIVE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966341#comment-13966341
]
Justin Coffey commented on HIVE-6784:
-------------------------------------
You've cited a "lazy" serde. Parquet is not "lazy". It is similar to ORC.
Have a look ORC's deserialize() method
(org.apache.hadoop.hive.ql.io.orc.OrcSerde):
{code}
@Override
public Object deserialize(Writable writable) throws SerDeException {
return writable;
}
{code}
A quick look through ORC code indicates to me that they don't do any reparsing
(though I might have missed something).
Looking through other serde's not a single one (that I checked) reparses
values. Value parsing is handled in ObjectInspectors (poke around
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils).
In my opinion, the *substantial* performance penalty that you are introducing
with this patch is going to be a much bigger negative to adopting parquet than
obliging people to rebuild their data set in the rare event that you have to
change a type.
And if you do need to change a type, insert overwrite table is a good work
around.
-1
> parquet-hive should allow column type change
> --------------------------------------------
>
> Key: HIVE-6784
> URL: https://issues.apache.org/jira/browse/HIVE-6784
> Project: Hive
> Issue Type: Bug
> Components: File Formats, Serializers/Deserializers
> Affects Versions: 0.13.0
> Reporter: Tongjie Chen
> Fix For: 0.14.0
>
> Attachments: HIVE-6784.1.patch.txt, HIVE-6784.2.patch.txt
>
>
> see also in the following parquet issue:
> https://github.com/Parquet/parquet-mr/issues/323
> Currently, if we change parquet format hive table using "alter table
> parquet_table change c1 c1 bigint " ( assuming original type of c1 is int),
> it will result in exception thrown from SerDe:
> "org.apache.hadoop.io.IntWritable cannot be cast to
> org.apache.hadoop.io.LongWritable" in query runtime.
> This is different behavior from hive (using other file format), where it will
> try to perform cast (null value in case of incompatible type).
> Parquet Hive's RecordReader returns an ArrayWritable (based on schema stored
> in footers of parquet files); ParquetHiveSerDe also creates an corresponding
> ArrayWritableObjectInspector (but using column type info from metastore).
> Whenever there is column type change, the objector inspector will throw
> exception, since WritableLongObjectInspector cannot inspect an IntWritable
> etc...
> Conversion has to happen somewhere if we want to allow type change. SerDe's
> deserialize method seems a natural place for it.
> Currently, serialize method calls createStruct (then createPrimitive) for
> every record, but it creates a new object regardless, which seems expensive.
> I think that could be optimized a bit by just returning the object passed if
> already of the right type. deserialize also reuse this method, if there is a
> type change, there will be new object to be created, which I think is
> inevitable.
--
This message was sent by Atlassian JIRA
(v6.2#6252)