[ 
https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838959#action_12838959
 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

Currently, the split that a mapper processes is determined by a variety of 
parameters, including the dfs block size, min split size etc.

It might be useful to have an option when the users wants a mapper so scan 1 
file. This will be specially useful for sort-merge join.
If the data is partitioned into various buckets, and each bucket us sorted, the 
sort merge join can join the different buckets together.

For example, consider the following scenario:

table T1: sorted and bucketed by column 'key' into 1000 buckets
table T2: sorted and bucketed by column 'key' into 1000 buckets


and the query:


select * from T1 join T2 on key
mapjoin.

Instead of joining the table T1 with T2, the 1000 buckets can be joined with 
each other individually.
Since the data is sorted on the join key, sort-merge join can be used.
Say the buckets are named: b0001, b0002 .. b1000
Say table T1 is the big table, and the buckets from T2 are being read as part 
of the mapper which is spawned to process T1,
under the current approach, it will be very difficult to perform outer joins.

For example, if bucket b1 for T1 contains:


1
2
5
6
9
16
22
30

and the corresponding bucket for T2 contains:

2
4
8


If there are 2 mappers for bucket b1 for T1, processing 4 records each 
((1,2,5,6) and (9.16.22.30) respectively.
It will be very difficult to perform a outer join. The mapper will need to peek 
into the previous record
and the next record respectively.

Moreover, it will be very difficult to ensure that the result also has 1000 
buckets. Another map-reduce job
will be needed for the same.

This can be easily solved if we are guaranteed that the whole bucket (or the 
file corresponding to the bucket),
will be processed by a single mapper.
 

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to