Block Sampling Impact

Ladda, Anand Fri, 15 Jun 2012 14:18:04 -0700

Hi
I was trying block sampling on a 6 million (~400MB sized table) and can see if 
I sample about 1 percent of the data I get about 3x faster response on the 
queries (I can also see difference in the data returned). The input format 
though is 'org.apache.hadoop.mapred.TextInputFormat' and not 
CombineHiveInputFormat as mentioned in the Block Sampling documentation. 
Question for the experts on whether block sampling is expected to work with 
other input formats as well
Thanks
Anand



hive> desc formatted orderdetail2;
OK
# col_name              data_type               comment

order_id                int                     None
item_id                 int                     None
order_date              string                  None
emp_id                  int                     None
promotion_id            int                     None
qty_sold                float                   None
unit_price              float                   None
unit_cost               float                   None
discount                float                   None
customer_id             int                     None

# Detailed Table Information
Database:               default
Owner:                  hdfs
CreateTime:             Fri Jun 15 16:51:44 EDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               --
Table Type:             MANAGED_TABLE
Table Parameters:
        transient_lastDdlTime   1339793622

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.124 seconds
hive>

Block Sampling Impact

Reply via email to