We have a small (3GB /280M rows) table with 435 partitions that is highly skewed: one partition has nearly 200M, two others have nearly 40M apiece, then the remaining 432 have all together less than 1% of total table size.
So .. the skew is something to be addressed. However - even give that - why would the following occur? Table Structure: # Partition Information # col_name data_type comment derived_create_dt string None # Detailed Table Information .. Protect Mode: None Retention: 0 .. Table Type: MANAGED_TABLE Table Parameters: SORTBUCKETCOLSPREFIX TRUE transient_lastDdlTime 1379678551 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Compressed: No Num Buckets: 64 Bucket Columns: [station_id] Sort Columns: [Order(col:station_id, order:1)] Storage Desc Params: serialization.format 1 HIGHLY SKEWED data: although This particular load: 300M rows 4GB 435 partitions Over 99% of data in just 3 out of the 435 partitons 2013-09-18 26733990 2013-09-19 191634067 2013-09-20 63790065 Map takes 10 min Reduce 13 mins Loading into partitions takes 3 hours 27 minutes