Another detail: ~400 mappers 64 reducers
2013/9/20 Stephen Boesch <java...@gmail.com> > > We have a small (3GB /280M rows) table with 435 partitions that is highly > skewed: one partition has nearly 200M, two others have nearly 40M apiece, > then the remaining 432 have all together less than 1% of total table size. > > So .. the skew is something to be addressed. However - even give that - > why would the following occur? > > > Table Structure: > > # Partition Information > # col_name data_type comment > derived_create_dt string None > > # Detailed Table Information > .. > Protect Mode: None > Retention: 0 > .. > Table Type: MANAGED_TABLE > Table Parameters: > SORTBUCKETCOLSPREFIX TRUE > transient_lastDdlTime 1379678551 > > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe > InputFormat: org.apache.hadoop.hive.ql.io.RCFileInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.RCFileOutputFormat > Compressed: No > Num Buckets: 64 > Bucket Columns: [station_id] > Sort Columns: [Order(col:station_id, order:1)] > Storage Desc Params: > serialization.format 1 > > HIGHLY SKEWED data: although > This particular load: > 300M rows > 4GB > 435 partitions > Over 99% of data in just 3 out of the 435 partitons > 2013-09-18 26733990 > 2013-09-19 191634067 > 2013-09-20 63790065 > > > > Map takes 10 min > Reduce 13 mins > Loading into partitions takes 3 hours 27 minutes > > >