Gopal V created HIVE-4488: ----------------------------- Summary: BucketizedHiveInputFormat is pessimistic with SMB split generation Key: HIVE-4488 URL: https://issues.apache.org/jira/browse/HIVE-4488 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.12.0 Environment: Ubuntu LXC Reporter: Gopal V
BucketizedHiveInputFormat generates fewer splits than possible when faced with a table structure where both tables are partitioned. When debugging query82 from the TPC-DS spec, there were 7 partitions in the lhs (store_sales) & 8 partitions in the rhs (inventory), with 1 bucket each. Only 7 splits are generated from the mapper, instead of a potential 56 mappers. {code} 13/05/01 07:08:22 INFO mapred.FileInputFormat: Total input paths to process : 1 13/05/01 07:08:22 INFO io.BucketizedHiveInputFormat: 7 bucketized splits generated from 344 original splits. {code} The loop that generates the splits is as follows {code} InputSplit[] iss = inputFormat.getSplits(newjob, 0); if (iss != null && iss.length > 0) { numOrigSplits += iss.length; result.add(new BucketizedHiveInputSplit(iss, inputFormatClass .getName())); } {code} As is clear from above, even though the more granular (per-file/per-partition) splits coming off the getSplits() is being added to a single bucket split. Logically, in our mapper we get {code} store_sales(2003)/000000_1) join MergeQueue( inv(1998-01-01)/000000_0 inv(1998-01-08)/000000_0 inv(1998-01-15)/000000_0 inv(1998-01-22)/000000_0 inv(1998-01-29)/000000_0 inv(1998-02-05)/000000_0 inv(1998-02-12)/000000_0 inv(1998-02-19)/000000_0 inv(1998-02-26)/000000_0 ) {code} Where ideally, we could've used a CombineFileInputFormat to get node locality for the merge queue inputs (viz BucketizedHiveInputSplit). This would be far better in generating splits & in getting more out of short-circuit reads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira