[
https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291338#comment-13291338
]
Travis Crawford commented on HIVE-2950:
---------------------------------------
Updated the branch against trunk and reran the tests.
https://travis.ci.cloudbees.com/job/HIVE-2950_partition_full_schema/1/
There were some test failures in CI so I ran these locally and they passed:
{code}
ant clean package test -Dtestcase=TestCliDriver
-Dqfile=create_view_partitioned.q
ant clean package test -Dtestcase=TestNegativeCliDriver
-Dqfile=create_or_replace_view1.q
ant clean package test -Dtestcase=TestNegativeCliDriver
-Dqfile=create_or_replace_view2.q
ant clean package test -Dtestcase=TestHiveServerSessions
{code}
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
> Key: HIVE-2950
> URL: https://issues.apache.org/jira/browse/HIVE-2950
> Project: Hive
> Issue Type: Bug
> Reporter: Travis Crawford
> Assignee: Travis Crawford
> Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage
> descriptor when adding a partition. Currently only columns stored in the
> table storage descriptor are copied - columns that are reported by the serde
> are not copied. Instead of copying the table storage descriptor columns into
> the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior
> when explicitly listing columns, behavior with HIVE-2941 patched in and serde
> reported columns, and finally the behavior with this patch (full table schema
> copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the
> table storage descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its
> storage descriptor and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by
> (part_dt string);
> hive> describe extended foo_test;
> OK
> name string
> age int
> part_dt string
>
> Detailed Table Information Table(tableName:foo_test, dbName:travis_test,
> owner:travis, createTime:1334256062, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null),
> FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
> type:string, comment:null)],
> location:hdfs://foo.com/warehouse/travis_test.db/foo_test,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name string
> age int
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:foo_test, createTime:1334256131,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
> type:string, comment:null), FieldSchema(name:age, type:int, comment:null),
> FieldSchema(name:part_dt, type:string, comment:null)],
> location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334256131})
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports
> the schema. Notice the following:
> * The table storage descriptor contains an empty list of columns. However,
> the table schema is available from the serde reflecting on the serialization
> class.
> * The partition storage descriptor does contain a single "part_dt" column
> that was copied from the table partition keys. The actual data columns are
> not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
> with serdeproperties
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
> stored as inputformat
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name struct<first_name:string,last_name:string> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Table Information Table(tableName:person_test,
> dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[],
> location:hdfs://foo.com/warehouse/travis_test.db/person_test,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:person_test, createTime:1334257029,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt,
> type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334257029})
> Time taken: 0.106 seconds
> hive>
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema
> (serde-reported columns + partition keys) into the partition storage
> descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are
> reported by the serde.
> * Partition storage descriptor now contains both the serde-reported schema,
> and full table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
> with serdeproperties
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
> stored as inputformat
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;
>
> OK
>
> name struct<first_name:string,last_name:string>
> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Table Information Table(tableName:person_test,
> dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[],
> location:hdfs://foo.com/warehouse/travis_test.db/person_test,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> OK
>
> Time taken: 0.296 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
>
> OK
>
> name struct<first_name:string,last_name:string>
> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:person_test, createTime:1334257504,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
> type:struct<first_name:string,last_name:string>, comment:from deserializer),
> FieldSchema(name:id, type:int, comment:from deserializer),
> FieldSchema(name:email, type:string, comment:from deserializer),
> FieldSchema(name:phones,
> type:array<struct<number:string,type:struct<value:int>>>, comment:from
> deserializer), FieldSchema(name:part_dt, type:string, comment:null)],
> location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334257504})
> Time taken: 0.133 seconds
> hive>
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira