Hi I was trying block sampling on a 6 million (~400MB sized table) and can see if I sample about 1 percent of the data I get about 3x faster response on the queries (I can also see difference in the data returned). The input format though is 'org.apache.hadoop.mapred.TextInputFormat' and not CombineHiveInputFormat as mentioned in the Block Sampling documentation. Question for the experts on whether block sampling is expected to work with other input formats as well Thanks Anand
hive> desc formatted orderdetail2; OK # col_name data_type comment order_id int None item_id int None order_date string None emp_id int None promotion_id int None qty_sold float None unit_price float None unit_cost float None discount float None customer_id int None # Detailed Table Information Database: default Owner: hdfs CreateTime: Fri Jun 15 16:51:44 EDT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: -- Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1339793622 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.124 seconds hive>