Re: Loading data into partition taking seven times total of (map+reduce) on highly skewed data

Stephen Boesch Fri, 20 Sep 2013 14:56:16 -0700

Another detail:   ~400 mappers  64 reducers


2013/9/20 Stephen Boesch <java...@gmail.com>

>
> We have a small (3GB /280M rows) table with 435 partitions that is highly
> skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
> then the remaining 432 have all together less than 1% of total table size.
>
> So .. the skew is something to be addressed.  However - even give that -
> why would the following occur?
>
>
> Table Structure:
>
>      # Partition Information
> # col_name             data_type           comment
>  derived_create_dt   string               None
>
> # Detailed Table Information
>  ..
> Protect Mode:       None
> Retention:           0
>  ..
> Table Type:         MANAGED_TABLE
> Table Parameters:
>  SORTBUCKETCOLSPREFIX TRUE
> transient_lastDdlTime 1379678551
>
> # Storage Information
> SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
>  InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
> OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
>  Compressed:         No
> Num Buckets:         64
>  Bucket Columns:     [station_id]
> Sort Columns:       [Order(col:station_id, order:1)]
>  Storage Desc Params:
> serialization.format 1
>
> HIGHLY SKEWED data:  although
> This particular load:
>     300M rows
>      4GB
>     435 partitions
>        Over 99% of data in just 3 out of the 435 partitons
>         2013-09-18 26733990
>       2013-09-19 191634067
>       2013-09-20 63790065
>
>
>
> Map takes 10 min
> Reduce 13 mins
> Loading into partitions takes 3 hours 27 minutes
>
>
>

Re: Loading data into partition taking seven times total of (map+reduce) on highly skewed data

Reply via email to