[jira] Commented: (HIVE-1900) a mapper should be able to span multiple partitions

Ning Zhang (JIRA) Fri, 07 Jan 2011 11:00:09 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978918#action_12978918
 ]


Ning Zhang commented on HIVE-1900:
----------------------------------

I remember I had encountered the problem before. Enabling a mapper to read from 
multiple partitions is trivial but there are some pitfalls to watch:

 1) partitioning columns are not present in the data file itself. The 
partitioning column value is appended during the RecordReader (or something 
like that). It assumes that all records come from the same partition. The 
assumption will be broken here. An example query you can try is 

   select ds, count(1) from srcpart where ds is not null group by ds;

 2) The merge job should be treated specially to not allow combined input from 
multiple partitions. 

 3) Auto-gathering stats from the FileSinkOperator need to be address for the 
case so that stats are maintained for multiple partitions. 

> a mapper should be able to span multiple partitions
> ---------------------------------------------------
>
>                 Key: HIVE-1900
>                 URL: https://issues.apache.org/jira/browse/HIVE-1900
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>
> Currently, a  mapper only spans a single partition which creates a problem in 
> the presence of many
> small partitions (which is becoming a common usecase in facebook).
> If the plan is the same, a mapper should be able to span files across 
> multiple partitions

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1900) a mapper should be able to span multiple partitions

Reply via email to