[
https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Phabricator updated HIVE-2950:
------------------------------
Attachment: HIVE-2950.D2769.1.patch
travis requested code review of "HIVE-2950 [jira] Hive should store the full
table schema in partition storage descriptors".
Reviewers: JIRA
When adding a partition, copy the full table schema into the partition
storage descriptor. The table storage descriptor may not contain the full
schema when using a serde that reports its schema. This change makes partitions
with serde-reported schema behave the same as partitions without a
serde-reported schem.
Hive tables have a schema, which is copied into the partition storage
descriptor when adding a partition. Currently only columns stored in the table
storage descriptor are copied - columns that are reported by the serde are not
copied. Instead of copying the table storage descriptor columns into the
partition columns, the full table schema should be copied.
DETAILS
This is a little long but is necessary to show 3 things: current behavior
when explicitly listing columns, behavior with HIVE-2941 patched in and serde
reported columns, and finally the behavior with this patch (full table schema
copied into the partition storage descriptor).
Here's an example of what currently happens. Note the following:
the two manually-defined fields defined for the table are listed in the
table storage descriptor.
both fields are present in the partition storage descriptor
This works great because users who query for a partition can look at its
storage descriptor and get the schema.
hive> create external table foo_test (name string, age int) partitioned by
(part_dt string);
hive> describe extended foo_test;
OK
name string
age int
part_dt string
Detailed Table Information Table(tableName:foo_test, dbName:travis_test,
owner:travis, createTime:1334256062, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null),
FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
type:string, comment:null)],
location:hdfs://foo.com/warehouse/travis_test.db/foo_test,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, primaryRegionName:, secondaryRegions:[]),
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
Time taken: 0.082 seconds
hive> alter table foo_test add partition (part_dt = '20120331T000000Z')
location 'hdfs://foo.com/foo/2012/03/31/00';
hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
OK
name string
age int
part_dt string
Detailed Partition Information Partition(values:[20120331T000000Z],
dbName:travis_test, tableName:foo_test, createTime:1334256131,
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:string, comment:null), FieldSchema(name:age, type:int, comment:null),
FieldSchema(name:part_dt, type:string, comment:null)],
location:hdfs://foo.com/foo/2012/03/31/00,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, primaryRegionName:, secondaryRegions:[]),
parameters:{transient_lastDdlTime=1334256131})
CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
Now let's examine what happens when creating a table when the serde reports
the schema. Notice the following:
The table storage descriptor contains an empty list of columns.
However, the table schema is available from the serde reflecting on the
serialization class.
The partition storage descriptor does contain a single "part_dt" column
that was copied from the table partition keys. The actual data columns are not
present.
hive> create external table travis_test.person_test partitioned by (part_dt
string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with
serdeproperties
("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat
"com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.08 seconds
hive> describe extended person_test;
OK
name struct<first_name:string,last_name:string> from deserializer
id int from deserializer
email string from deserializer
phones array<struct<number:string,type:struct<value:int>>> from
deserializer
part_dt string
Detailed Table Information Table(tableName:person_test,
dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[],
location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
primaryRegionName:, secondaryRegions:[]),
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
Time taken: 0.147 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z')
location 'hdfs://foo.com/foo/2012/03/31/00';
OK
Time taken: 0.149 seconds
hive> describe extended person_test partition (part_dt = '20120331T000000Z');
OK
part_dt string
Detailed Partition Information Partition(values:[20120331T000000Z],
dbName:travis_test, tableName:person_test, createTime:1334257029,
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt,
type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
primaryRegionName:, secondaryRegions:[]),
parameters:{transient_lastDdlTime=1334257029})
Time taken: 0.106 seconds
hive>
PROPOSED BEHAVIOR
I believe the correct thing to do is copy the full table schema
(serde-reported columns + partition keys) into the partition storage
descriptor. Notice the following:
Table storage descriptor does not contain any columns, because they are
reported by the serde.
Partition storage descriptor now contains both the serde-reported
schema, and full table schema.
hive> create external table travis_test.person_test partitioned by (part_dt
string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with
serdeproperties
("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
stored as inputformat
"com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.076 seconds
hive> describe extended person_test;
OK
name struct<first_name:string,last_name:string>
from deserializer
id int from deserializer
email string from deserializer
phones array<struct<number:string,type:struct<value:int>>> from
deserializer
part_dt string
Detailed Table Information Table(tableName:person_test,
dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[],
location:hdfs://foo.com/warehouse/travis_test.db/person_test,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
primaryRegionName:, secondaryRegions:[]),
partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
Time taken: 0.155 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z')
location 'hdfs://foo.com/foo/2012/03/31/00';
OK
Time taken: 0.296 seconds
hive> describe extended person_test partition (part_dt = '20120331T000000Z');
OK
name struct<first_name:string,last_name:string>
from deserializer
id int from deserializer
email string from deserializer
phones array<struct<number:string,type:struct<value:int>>> from
deserializer
part_dt string
Detailed Partition Information Partition(values:[20120331T000000Z],
dbName:travis_test, tableName:person_test, createTime:1334257504,
lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
type:struct<first_name:string,last_name:string>, comment:from deserializer),
FieldSchema(name:id, type:int, comment:from deserializer),
FieldSchema(name:email, type:string, comment:from deserializer),
FieldSchema(name:phones,
type:array<struct<number:string,type:struct<value:int>>>, comment:from
deserializer), FieldSchema(name:part_dt, type:string, comment:null)],
location:hdfs://foo.com/foo/2012/03/31/00,
inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
serialization.format
=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:,
secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257504})
Time taken: 0.133 seconds
hive>
TEST PLAN
empty
REVISION DETAIL
https://reviews.facebook.net/D2769
AFFECTED FILES
ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
MANAGE HERALD DIFFERENTIAL RULES
https://reviews.facebook.net/herald/view/differential/
WHY DID I GET THIS EMAIL?
https://reviews.facebook.net/herald/transcript/6297/
Tip: use the X-Herald-Rules header to filter Herald messages in your client.
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
> Key: HIVE-2950
> URL: https://issues.apache.org/jira/browse/HIVE-2950
> Project: Hive
> Issue Type: Bug
> Reporter: Travis Crawford
> Assignee: Travis Crawford
> Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage
> descriptor when adding a partition. Currently only columns stored in the
> table storage descriptor are copied - columns that are reported by the serde
> are not copied. Instead of copying the table storage descriptor columns into
> the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior
> when explicitly listing columns, behavior with HIVE-2941 patched in and serde
> reported columns, and finally the behavior with this patch (full table schema
> copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the
> table storage descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its
> storage descriptor and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by
> (part_dt string);
> hive> describe extended foo_test;
> OK
> name string
> age int
> part_dt string
>
> Detailed Table Information Table(tableName:foo_test, dbName:travis_test,
> owner:travis, createTime:1334256062, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null),
> FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt,
> type:string, comment:null)],
> location:hdfs://foo.com/warehouse/travis_test.db/foo_test,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name string
> age int
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:foo_test, createTime:1334256131,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
> type:string, comment:null), FieldSchema(name:age, type:int, comment:null),
> FieldSchema(name:part_dt, type:string, comment:null)],
> location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334256131})
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports
> the schema. Notice the following:
> * The table storage descriptor contains an empty list of columns. However,
> the table schema is available from the serde reflecting on the serialization
> class.
> * The partition storage descriptor does contain a single "part_dt" column
> that was copied from the table partition keys. The actual data columns are
> not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
> with serdeproperties
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
> stored as inputformat
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name struct<first_name:string,last_name:string> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Table Information Table(tableName:person_test,
> dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[],
> location:hdfs://foo.com/warehouse/travis_test.db/person_test,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:person_test, createTime:1334257029,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt,
> type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334257029})
> Time taken: 0.106 seconds
> hive>
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema
> (serde-reported columns + partition keys) into the partition storage
> descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are
> reported by the serde.
> * Partition storage descriptor now contains both the serde-reported schema,
> and full table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt
> string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
> with serdeproperties
> ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
> stored as inputformat
> "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat
> "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;
>
> OK
>
> name struct<first_name:string,last_name:string>
> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Table Information Table(tableName:person_test,
> dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[],
> location:hdfs://foo.com/warehouse/travis_test.db/person_test,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)],
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489},
> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z')
> location 'hdfs://foo.com/foo/2012/03/31/00';
> OK
>
> Time taken: 0.296 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
>
> OK
>
> name struct<first_name:string,last_name:string>
> from deserializer
> id int from deserializer
> email string from deserializer
> phones array<struct<number:string,type:struct<value:int>>> from
> deserializer
> part_dt string
>
> Detailed Partition Information Partition(values:[20120331T000000Z],
> dbName:travis_test, tableName:person_test, createTime:1334257504,
> lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name,
> type:struct<first_name:string,last_name:string>, comment:from deserializer),
> FieldSchema(name:id, type:int, comment:from deserializer),
> FieldSchema(name:email, type:string, comment:from deserializer),
> FieldSchema(name:phones,
> type:array<struct<number:string,type:struct<value:int>>>, comment:from
> deserializer), FieldSchema(name:part_dt, type:string, comment:null)],
> location:hdfs://foo.com/foo/2012/03/31/00,
> inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe,
> parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person,
> serialization.format=1}), bucketCols:[], sortCols:[], parameters:{},
> primaryRegionName:, secondaryRegions:[]),
> parameters:{transient_lastDdlTime=1334257504})
> Time taken: 0.133 seconds
> hive>
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira