Hi
I was trying block sampling on a 6 million (~400MB sized table) and can see if
I sample about 1 percent of the data I get about 3x faster response on the
queries (I can also see difference in the data returned). The input format
though is 'org.apache.hadoop.mapred.TextInputFormat' and not
CombineHiveInputFormat as mentioned in the Block Sampling documentation.
Question for the experts on whether block sampling is expected to work with
other input formats as well
Thanks
Anand
hive> desc formatted orderdetail2;
OK
# col_name data_type comment
order_id int None
item_id int None
order_date string None
emp_id int None
promotion_id int None
qty_sold float None
unit_price float None
unit_cost float None
discount float None
customer_id int None
# Detailed Table Information
Database: default
Owner: hdfs
CreateTime: Fri Jun 15 16:51:44 EDT 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: --
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1339793622
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.124 seconds
hive>